本篇博文主要内容为 2026-05-15 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。

说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。

提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。

目录

概览 (2026-05-15)

今日共更新792篇论文,其中:

  • 自然语言处理121篇(Computation and Language (cs.CL))
  • 人工智能292篇(Artificial Intelligence (cs.AI))
  • 计算机视觉180篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习265篇(Machine Learning (cs.LG))
  • 多智能体系统18篇(Multiagent Systems (cs.MA))
  • 信息检索14篇(Information Retrieval (cs.IR))
  • 人机交互30篇(Human-Computer Interaction (cs.HC))

多智能体系统

[MA-0] APWA: A Distributed Architecture for Parallelizable Agent ic Workflows

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的自主多智能体系统在任务规模和复杂度增长时面临的推理、协调和计算扩展瓶颈,这些瓶颈阻碍了系统对高度可并行化任务的高吞吐量处理,尽管底层LLMs具备并行计算和推理原语。解决方案的关键在于提出的智能体并行工作负载架构(Agent-Parallel Workload Architecture, APWA),该分布式多智能体系统架构通过将工作流分解为无干扰的子问题,利用独立资源并行处理且无需跨智能体通信,从而支持异构数据和并行处理模式,并能够动态分解复杂查询为可并行化工作流,在先前系统完全失效的更大规模任务中实现有效扩展。

链接: https://arxiv.org/abs/2605.15132
作者: Evan Rose,Tushin Mallick,Matthew D. Laws,Cristina Nita-Rotaru,Alina Oprea
机构: Northeastern University (东北大学)
类目: Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 25 pages, 2 figures, 14 tables

点击查看摘要

Abstract:Autonomous multi-agent systems based on large language models (LLMs) have demonstrated remarkable abilities in independently solving complex tasks in a wide breadth of application domains. However, these systems hit critical reasoning, coordination, and computational scaling bottlenecks as the size and complexity of their tasks grow. These limitations hinder multi-agent systems from achieving high-throughput processing for highly parallelizable tasks, despite the availability of parallel computing and reasoning primitives in the underlying LLMs. We introduce the Agent-Parallel Workload Architecture (APWA), a distributed multi-agent system architecture designed for the efficient processing of heavily parallelizable agentic workloads. APWA facilitates parallel execution by decomposing workflows into non-interfering subproblems that can be processed using independent resources without cross-communication. It supports heterogeneous data and parallel processing patterns, and it accommodates tasks from a wide breadth of domains. In our evaluation, we demonstrate that APWA can dynamically decompose complex queries into parallelizable workflows and scales on larger tasks in settings where prior systems fail completely.

[MA-1] A Prototyping Framework for Distributed Control of Multi-Robot Systems

【速读】:该论文旨在解决多机器人系统分布式控制中理论与实际测试之间的鸿沟,即如何低成本且便捷地验证分布式优化算法的有效性。解决方案的关键在于采用单程序多数据(SPMD)范式,在单一计算机上利用多核模拟分布式控制,每个核心独立运行相同算法,仅依赖局部状态和邻居间通信,从而无需真实硬件即可评估算法性能,并通过点质量模型、高保真四旋翼模型及Crazyflie实验平台的对比验证其可行性。

链接: https://arxiv.org/abs/2605.15049
作者: Junaid Ahmed Memon,Allan Andre Do Nascimento,Kostas Margellos,Antonis Papachristodoulou
机构: Department of Engineering Science, University of Oxford (牛津大学工程科学系)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: Accepted at IFAC World Congress 2026

点击查看摘要

Abstract:This paper presents a prototyping framework for distributed control of multi-robot systems, aimed at bridging theory and practical testing of distributed optimization algorithms. Using the Single Program, Multiple Data (SPMD) paradigm, the framework emulates distributed control on a single computer, with each core running the same algorithm using local states and neighbour-to-neighbour communication. We demonstrate the framework on a four-quadrotor position-swapping task using a non-cooperative game-theoretic distributed algorithm. Computational time and trajectory data are compared across the supported dynamics levels: a point-mass model, a high-fidelity quadrotor model, and an experimental hardware testbed using Crazyflie quadcopters. The results show that the framework provides a low-cost and accessible approach for validating distributed algorithms.

[MA-2] AI Knows When Its Being Watched: Functional Strategic Action and Contextual Register Modulation in Large Language Models

【速读】:该论文旨在解决大型语言模型(LLMs)在社交结构化语境中作为交际主体时,是否会对感知到的社会观察背景产生系统性语言适应行为的问题,这一问题的解答对AI治理和算法审计具有直接意义。解决方案的关键在于设计一个受控的多智能体辩论实验,通过操纵社会观察条件的框架(包括明确的人类监控、监控否定、以及用自动化AI审计系统替代人类观察者等五种条件),并测量语言特征(如类型-标记比变化和信息长度)的差异性,以此验证LLM的行为是否受观察者身份和观察框架的调节。实验发现,监控条件(包括人类和AI监控)显著提高了语言的形式化程度,且人类观察者比自动化AI监督引发更强的语域适应,从而揭示了LLM作为上下文敏感的交际主体的行为机制。

链接: https://arxiv.org/abs/2605.15034
作者: Vinicius Covas,Jorge Alberto Hidalgo Toledo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Large language models (LLMs) have been extensively studied from computational and cognitive perspectives, yet their behavior as communicative actors in socially structured contexts remains underexplored. This study examines whether LLM-based multi-agent systems exhibit systematic linguistic adaptation in response to perceived social observation contexts – a question with direct implications for AI governance and auditing. Drawing on Habermas’s (1981) Theory of Communicative Action, Goffman’s (1959) dramaturgical model, Bell’s (1984) Audience Design framework, and the Hawthorne Effect, we report a controlled experiment involving 100 multi-agent debate sessions across five conditions (n = 20 each). Conditions varied the framing of social observation – from explicit monitoring by university researchers, to negation of monitoring, to an observer-substitution condition replacing human researchers with an automated AI auditing system. Monitored conditions (Delta+24.9%, Delta+24.2%) and the automated AI monitoring condition (Delta+22.2%) produce higher TTR change than audience-framing conditions (Delta+17.7%), F(4, 94) = 2.79, p = .031. Message length shows a fully dissociated effect, F(4, 95) = 19.55, p .001. A fifth condition – replacing human with AI observers – yields intermediate TTR adaptation, suggesting LLM behavior is sensitive to observer identity: human evaluation elicits stronger register formalization than automated AI surveillance. We discuss implications for AI governance, algorithmic auditing, and the repositioning of LLMs as contextually sensitive communicative actors.

[MA-3] Multi-Agent ic Approach for History Matching of Oil Reservoirs

【速读】:该论文试图解决油藏工程中历史匹配(history matching)这一逆问题在自动化部署中的实际困难:尽管已有自动化方法可减少人工努力,但工程师仍需手动配置异构工作流,包括参数选择、物理可行边界、优化器选择、超参数调优、模拟器执行和诊断报告,导致实际应用门槛高。解决方案之关键在于提出了PetroGraph,一个多智能体框架(multi-agent framework),它将历史匹配工作流分解为模型审查、实验规划、参数化、优化、模拟和总结等专门智能体(specialized agents),并融合大语言模型(LLM)智能体、领域专用工具、基于检索增强的模拟器文档访问、修改后ECLIPSE输入文件验证、人在回路检查点(human-in-the-loop checkpoints)以及基于OPM Flow的模拟后端。该设计允许用户通过自然语言启动和引导历史匹配,同时保留对选定参数和优化设置的显式控制,从而通过多智能体编排自动化关键决策,降低操作复杂模拟工作流的专业壁垒。

链接: https://arxiv.org/abs/2605.15028
作者: Linar Samigullin,Sergei Shumilin,Evgeny Burnaev
机构: Skoltech, AI Center (斯科尔科沃科技学院人工智能中心); AIRI (人工智能研究所)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:History matching is a central inverse problem in reservoir engineering, where uncertain reservoir parameters must be calibrated against observations. Although automated history matching can reduce manual effort, practical deployment remains difficult because engineers must still configure heterogeneous workflows involving parameter selection, physically admissible bounds, optimizer choice, hyperparameter tuning, simulator execution, and diagnostic reporting. We propose PetroGraph, a multi-agent framework for intelligent reservoir history matching that decomposes this workflow into specialized agents for model review, experimental planning, parameterization, optimization, simulation, and summarization. The system combines large language model agents with domain-specific tools, retrieval-augmented access to simulator documentation, validation of modified ECLIPSE input decks, human-in-the-loop checkpoints, and an OPM Flow-based simulation backend. This design enables users to initiate and steer history matching through natural language while preserving explicit control over selected parameters and optimization settings. We evaluate PetroGraph on three reservoir models of increasing complexity: the synthetic SPE1 model, the faulted SPE9 benchmark, and the real-field Norne model. Using weighted normalized root mean square error as the objective, PetroGraph reduces the mismatch by 95% on SPE1, 69% on SPE9, and 13% on Norne. These results demonstrate that multi-agent orchestration can automate key decisions in history matching, lower the expertise barrier for operating complex simulation workflows, and provide a flexible foundation for extensible, domain-aware reservoir model adaptation.

[MA-4] Agreement Diversity and Polarization Indices for Approval Elections

【速读】:该论文试图解决在批准选举(approval elections)中如何量化一致性(agreement)、多样性(diversity)和两极分化(polarization)的问题,并确保这些指数相对于饱和度(saturation)进行归一化,即不同选举若仅在平均选民批准的候选比例上存在差异,但本质相似,则其指数值应相近。解决方案之关键在于提出了若干新的指数,通过严格分析其数学性质(如归一化、单调性等),并利用这些指数构建了批准选举的映射图,同时对来自 Pabulib、Preflib 等真实选举数据进行了比较,揭示了不同选举的相似性与差异性。

链接: https://arxiv.org/abs/2605.14983
作者: Piotr Faliszewski,Jitka Mertlová,Krzysztof Sornat,Stanisław Szufa,Tomasz Wąs
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:An index is a function that given an election outputs a value between 0 and 1, indicating the extent to which this election has a particular feature. We seek indices that capture agreement, diversity, and polarization among voters in approval elections, and that are normalized with respect to saturation. By the latter we mean that if two elections differ by the fraction of candidates approved by an average voter, but otherwise are of similar nature, then they should have similar index values. We propose several indices, analyze their properties, and use them to (a) derive a new map of approval elections, and (b) show similarities and differences between various real-life elections from Pabulib, Preflib and other sources.

[MA-5] mporal Fair Division in Multi-Agent Systems: From Precise Alternation Metrics to Scalable Coordination Proxies

【速读】:该论文旨在解决重复多智能体资源竞争场景中传统公平性度量(如奖励公平性)无法捕捉跨完整交互历史的时序公平性(temporal fairness)问题,具体表现为Q-learning智能体在时序公平性指标(如RP和ALT)上表现比随机策略差10-73%,而传统奖励公平性指标却保持误导性的高值(如n=3时>0.92)。解决方案的关键在于:引入一组轻量级度量——旋转周期性(Rotational Periodicity, RP)及其互补的ALT滑动窗口度量家族,构建统一框架;其中RP将时序公平性分解为旋转分数(Rotational Score, RS)和等待周期评估(Waiting Periods Evaluation, WPE)两个子度量,实现O(nu+n)时间复杂度(nu为回合数,n为智能体数),远优于ALT的O(nu*n),并通过建立完美交替(Perfect Alternation, PA)作为规范解,连接比例性、无嫉妒性和n周期循环分配,从而为重复公平分配提供诊断工具包。

链接: https://arxiv.org/abs/2605.14879
作者: Nikolaos Al. Papadopoulos
机构: University of Macedonia(马其顿大学); Department of Applied Informatics(应用信息学系)
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
备注: 15 pages, 3 figures, 8 tables. Submitted to ACM Transactions on Economics and Computation, Special Issue on Fair Division

点击查看摘要

Abstract:A plethora real-world environments require agents to compete repeatedly for the same limited resource, calling for a temporal notion of fairness judged across entire interaction histories. This paper advances the theory of temporal fair division by introducing Rotational Periodicity (RP), a family of lightweight metrics, alongside the ALT family of sliding-window measures, within a unified framework for repeated multi-agent resource competition. We formalise the Multi-Agent Battle of the Exes (MBoE) as a repeated fair division instance and establish Perfect Alternation (PA) as its canonical temporally fair solution, drawing connections to proportionality, envy-freeness, and n-periodic round-robin allocation. RP decomposes temporal fairness into two complementary sub-measures: Rotational Score (RS) and Waiting Periods Evaluation (WPE), achieving O(nu+n) time complexity versus the O(nu*n) of ALT, where nu is the episode count and n the agent count. Empirical evaluation across n in 2,3,5,8,10 reveals three findings. First, both RP and ALT expose a coordination failure invisible to traditional metrics: Q-learning agents perform worse than random policies by 10-73% on RP and 7-35% on CALT, while Reward Fairness remains misleadingly high (above 0.92 for n=3). Second, RP achieves 12-25x computational speedup over ALT, growing with n. Third, the two families are complementary: ALT provides richer discrimination for small populations; RP scales reliably where ALT becomes intractable. Together they form a diagnostic toolkit for temporal fair division. Comments: 15 pages, 3 figures, 8 tables. Submitted to ACM Transactions on Economics and Computation, Special Issue on Fair Division Subjects: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) MSC classes: 91A20, 91B32, 68T42 ACMclasses: I.2.11; F.2.2; J.4 Cite as: arXiv:2605.14879 [cs.MA] (or arXiv:2605.14879v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2605.14879 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-6] Decision-Level Fusion for Robust Wearable Affect Recognition

【速读】:该论文试图解决穿戴式生理信号情感状态自动识别在真实部署中面临的非平稳动态、伪影及传感器缺失问题,同时克服传统固定基谱特征(如FFT频带功率和Welch PSD)过度平滑短时判别模式的局限。解决方案的关键在于:1) 提出一种非平稳处理流程,将傅里叶-贝塞尔级数展开(FBSE)与经验小波变换(EWT)数据驱动谱分割相结合,提取模式级瞬态描述符,从而保留短时判别信息;2) 采用决策级聚合(decision-level aggregation)而非特征级融合,依据预测不确定性和模态可靠性对每个模态进行加权,从而提升对异质性和部分可靠传感的鲁棒性。

链接: https://arxiv.org/abs/2605.14878
作者: Lokesh Singh,Athina Georgara,Jayati Deshmukh,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn
机构: University of Southampton(南安普顿大学)
类目: Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Automatic recognition of affective state from wearable physiology has clear societal impact for public health, preventive care, and stress-aware interventions, but real deployments require robustness to non-stationary dynamics, artefacts, and missing sensors. We study this problem on WESAD, using baseline, stress, and amusement conditions, where common fixed-basis spectral features such as FFT bandpower and Welch PSD can oversmooth short-lived discriminative patterns. We propose a non-stationary pipeline that combines Fourier-Bessel Series Expansion (FBSE) with EWT data-driven spectral segmentation to extract mode-wise transient descriptors. For multimodal integration, we adopt decision-level aggregation over per-modality predictors and weight each modality by predictive uncertainty and modality reliability. Results on WESAD, using 15 subjects and ECG, EDA, BVP, EMG, and ACC signals across three classes, indicate that decision-level aggregation is approximately 84 percent of the time at least as good as feature-level aggregation, and approximately 48 percent of the time strictly better, suggesting improved robustness under heterogeneous and partially reliable sensing.

[MA-7] IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification

【速读】:该论文试图解决传统作战计划生成与验证方法在实际应用中分别面临的生成不可行性(generation infeasibility)与验证不充分性(verification insufficiency)问题。解决方案的关键在于提出了一种集成多智能体框架IFPV(Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification),该框架由两个紧密耦合的模块组成:多视角分层智能体MPHA(Multi-Perspective Hierarchical Agents)用于生成式作战规划,以及对抗认知仿真引擎ACSE(Adversarial Cognitive Simulation Engine)用于高保真对抗计划验证。MPHA通过Pathfinder、Analyst和Planner三类智能体的协作,将指挥官意图分解为可执行的多平台战术动作序列;ACSE则引入一个配备定制世界模型的对手,动态预测关键任务平台的未来演化并对候选计划执行对抗性反制。实验表明,相比单步大语言模型(LLM)基线,IFPV将任务成功率提升19.4%,作战成本降低41.7%;相比传统规则验证器,ACSE将平均压制率提高31.8%,证明该验证环境能更严格、更具区分度地揭示候选计划的潜在脆弱性。

链接: https://arxiv.org/abs/2605.14851
作者: Zhigao Huang,Zhengqing Hu,Dong Chen,Shaohan Zhang,Zhao Jin,Bo Zhang,Han Wu,Mingliang Xu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: Submitted to Neurocomputing

点击查看摘要

Abstract:Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification (IFPV). IFPV consists of two tightly coupled modules: Multi-Perspective Hierarchical Agents (MPHA) for generative operational planning and an Adversarial Cognitive Simulation Engine (ACSE) for high-fidelity adversarial plan verification. MPHA decomposes commander intent into executable multi-platform tactical action sequences through the collaboration of Pathfinder, Analyst, and Planner agents. ACSE introduces an opponent equipped with a customized world model, which predicts the future evolution of mission-critical platforms and conducts dynamic counteractions against candidate plans. Simulation experiments in the Asymmetric Combat Tactic Simulator (ACTS) show that IFPV improves mission success by 19.4% and reduces operational cost by 41.7% compared with a single-step large language model (LLM) planning baseline. Compared with a traditional rule-based validator, ACSE increases the average suppression rate by 31.8%, indicating that the proposed verification environment is stricter and more discriminative in revealing the latent vulnerabilities of candidate plans. The code for IFPV can be found at this https URL.

[MA-8] Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLM s with Iterative Distillation of Experience

【速读】:该论文试图解决在与冻结的“黑箱”大语言模型(Large Language Model, LLM)交互时,提示工程从启发式调优向关键优化挑战转变所带来的效率与性能瓶颈问题,其核心目标是自动化、高效地生成能够最大化任务特定奖励的提示策略。解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的框架,通过迭代蒸馏经验(iterative distillation of experience)来训练一个轻量级的提示者模型(prompter model),该模型被优化为为更大的冻结工人模型(worker LLM)生成单次提示权重。关键创新在于利用对比经验缓冲区(contrastive experience buffer),将标量奖励(scalar rewards)与密集文本批评(dense textual critiques)耦合,从而将迭代式的提示优化过程摊销为单次策略权重,大幅提升样本效率与任务性能,在逻辑推理和工具使用任务上分别实现了从55%到90%和从74%到91%的显著性能提升。

链接: https://arxiv.org/abs/2605.14443
作者: Krishna Sayana,Ketan Todi,Ambarish Jash
机构: Google Research (谷歌研究院), Mountain View, CA (加州山景城)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 10 pages and reference, appendix

点击查看摘要

Abstract:The shift toward interacting with frozen, “black-box” Large Language Models (LLMs) has transformed prompt engineering from a heuristic exercise into a critical optimization challenge. We propose a Reinforcement Learning (RL) framework for training learned prompting policies via iterative distillation of experience. In this architecture, a lightweight prompter model is optimized to maximize task-specific rewards for a larger, frozen worker LLM. By utilizing a contrastive experience buffer that couples scalar rewards with dense textual critiques, our approach effectively amortizes iterative prompt refinement into single-shot policy weights. Our experimental analysis focuses on the Big Bench Extra Hard (BBEH) and Tau-bench suites, covering a diverse range of multi-step reasoning and tool-use tasks. We demonstrate significant gains, improving performance from 55% to 90% in logic-intensive reasoning and 74% to 91% in tool-use tasks. Furthermore, we analyze the structural evolution of prompts, demonstrating how the policy discovers specialized algorithmic heuristics. We provide comprehensive comparisons against state-of-the-art evolutionary baselines like GEPA, showing that iterative distillation achieves superior performance with higher sample efficiency. Comments: 10 pages and reference, appendix Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2605.14443 [cs.AI] (or arXiv:2605.14443v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14443 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[MA-9] Data-Augmented Game Starts for Accelerating Self-Play Exploration in Imperfect Information Games

【速读】:该论文试图解决大规模不完全信息竞争博弈(如StarCraft、Dota、CounterStrike)中,由于稀疏奖励(sparse rewards)和长视界(long horizons)导致的近似均衡计算在计算上不可行的问题。解决方案的关键在于提出一种多智能体起始状态采样策略——数据增强游戏起始(Data-Augmented Game Starts, DAGS),其核心假设是熟练人类的离线演示(offline demonstrations)能够良好覆盖与均衡玩法相关的高层策略,因此通过从离线数据中采样的中间状态来初始化强化学习数据收集,从而有效引导探索进入战略相关的子游戏(subgames),大幅加速正则化策略梯度方法(regularized policy-gradient methods)在两人零和博弈中的在线探索。此外,为应对增强起始状态分布可能引入的有偏均衡(biased equilibria),该方法采用多任务观察标志(multi-task observation flags)作为简单有效的缓解措施。

链接: https://arxiv.org/abs/2605.14379
作者: JB Lanier,Nathan Monette,Pierre Baldi,Roy Fox
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注: 17 pages, 4 figures. JB Lanier and Nathan Monette contributed equally

点击查看摘要

Abstract:Finding approximate equilibria for large-scale imperfect-information competitive games such as StarCraft, Dota, and CounterStrike remains computationally infeasible due to sparse rewards and challenging exploration over long horizons. In this paper, we propose a multi-agent starting-state sampling strategy designed to substantially accelerate online exploration in regularized policy-gradient game methods for two-player zero-sum (2p0s) games. Motivated by an assumption that offline demonstrations from skilled humans can provide good coverage of high-level strategies relevant to equilibrium play, we propose the initialization of reinforcement learning data collection at intermediate states sampled from offline data to facilitate exploration of strategically relevant subgames. Referring to this method as Data-Augmented Game Starts (DAGS), we perform experiments using synthetic datasets and analytically tractable, long-horizon control variants of two-player Kuhn Poker, Goofspiel, and a counterexample game designed to penalize biased beliefs over hidden information. Under fixed computational budgets, DAGS enables regularized policy gradient methods to achieve lower exploitability in games with significantly more challenging exploration. We show that augmenting starting state distributions when solving imperfect information games can lead to biased equilibria, and we provide a straightforward mitigation to this in the form of multi-task observation flags. Finally, we release a new set of benchmark environments that drastically increase exploration challenges and state counts in existing OpenSpiel games while keeping exploitability measurements analytically tractable.

[MA-10] Quantum Advantage in Multi Agent Reinforcement Learning

【速读】:该论文旨在解决量子多智能体强化学习(quantum multi-agent reinforcement learning, QMARL)领域中,现有评估因缺乏可证明的经典基线而无法严格区分量子优势与算法巧合的问题;其关键在于提出一个分散式QMARL框架,其中智能体使用共享纠缠态的变分量子电路(variational quantum circuit, VQC)作为演员,并通过在CHSH博弈(其经典胜率上限已被数学证明为0.75)中逼近Tsirelson极限(0.854)以及与非纠缠量子电路的对比实验,实证地证明量子纠缠本身而不是量子电路能力是产生协调增益的主动机制,同时在合作导航(CoopNav)场景中进一步表明,无纠缠的QMARL即能显著超越经典MAA2C基线,而量子演员与经典集中式评论家组成的混合配置则取得了最优性能。

链接: https://arxiv.org/abs/2605.14235
作者: Simranjeet Singh Dahia,Claudia Szabo
机构: Adelaide University(阿德莱德大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA); Quantum Physics (quant-ph)
备注: 19 pages

点击查看摘要

Abstract:We present an empirical evaluation of quantum entanglement in agent coordination within quantum multi agent reinforcement learning (QMARL). While QMARL has attracted growing interest recently, most prior work evaluates quantum policies without provable baselines, making it impossible to rigorously distinguish quantum advantage from algorithmic coincidence. We address this directly by evaluating a decentralized QMARL framework with variational quantum circuit (VQC) actors with shared entangled states. In the CHSH game, which has a mathematically proven classical performance ceiling of 0.75 win rate, we show that entangled QMARL agents approach the Tsirelson limit of 0.854, providing clear evidence of their quantum advantage. We show that unentangled quantum circuits match the classical baseline, confirming that entanglement and not the quantum circuit itself is the active coordination mechanism. We also explore the effect of specific entanglement structures, as some Bell states enable coordination gains while others actively harm performance. On cooperative navigation (CoopNav), QMARL without entanglement achieves \sim2\times improvement in success rate over classical MAA2C ( \sim 0.85 versus \sim 0.40), with the hybrid configuration, quantum actor paired with a classical centralised critic, outperforming both fully classical and fully quantum solutions. We present our experimental analysis and discuss future work.

[MA-11] Privacy Preserving Multi Agent Path Finding AAMAS2026

【速读】:该论文试图解决多智能体路径规划(MAPF)中由隐私约束引发的路径冲突问题,具体针对两种隐私类型:规划级隐私(planning-level privacy),即规划过程中智能体无法准确识别其他智能体的计划位置;执行级隐私(execution-level privacy),即执行过程中智能体不允许感知其他智能体的位置。解决方案的关键在于:对于规划级隐私,提出了一个通用框架,通过向规划过程中添加模拟智能体(mock agents)来混淆真实路径信息;对于执行级隐私,将两种流行的MAPF算法PIBT和LaCAM进行适应性改造,使其在执行阶段禁止感知其他智能体位置,并进一步提出了一种后处理技术,在不损失隐私的前提下降低路径总成本,实验验证了该技术能显著改善成本。

链接: https://arxiv.org/abs/2605.14119
作者: Rotem Lev Lehman,Roni Stern,Guy Shani
机构: Ben-Gurion University of the Negev (内盖夫本-古里安大学)
类目: Multiagent Systems (cs.MA)
备注: 16 pages, 5 figures, to be published in AAMAS 2026 as an extended abstract

点击查看摘要

Abstract:In the multi-agent path finding (MAPF) problem, a group of agents search in a graph for a path for each agent where no two paths collide. While in all applications of MAPF the agents must not collide with each other, in some of them the agents may not wish to share their paths due to privacy constraints. In this work, we formulate two types of privacy constraints for MAPF and propose algorithms that preserve them. The first type of privacy we consider is planning-level privacy, which means that during planning, the agents cannot identify exactly the planned location of the other agents. We propose a general framework for obtaining planning-level privacy, which works by adding mock agents to the planning process. The second type of privacy we consider is execution-level privacy, which is relevant when agents have limited sensing capabilities. Execution-level privacy is preserved if none of the agents is allowed to sense the location of the other agents during execution. We show how to adapt two popular MAPF algorithms, namely PIBT and LaCAM, such that they preserve execution-level privacy. Lastly, we propose a post-processing technique that allows the agents to reduce the sum of costs of the returned solution without losing any privacy. We also implemented our algorithms and evaluated them empirically, showing that the proposed post-processing technique indeed improved cost significantly.

[MA-12] ProtoMedAgent : Multimodal Clinical Interpretability via Privacy-Aware Agent ic Workflows

【速读】:该论文旨在解决可解释原型网络(interpretable prototype networks)在临床诊断中输出缺乏语义结构、以及标准检索增强生成(RAG)因“检索谄媚”(retrieval sycophancy)导致大型语言模型(LLM)产生事后幻觉(hallucinate post-hoc rationalizations)的问题。解决方案的关键在于提出ProtoMedAgent框架,它基于冻结的原型骨干网络,将潜在视觉和表格特征蒸馏为离散语义记忆,并将多模态临床报告生成形式化为一个迭代的、零梯度的测试时优化问题,通过严格的神经符号瓶颈(neuro-symbolic bottleneck)约束在线生成过程:利用精确集合论差分和反思性Scribe-Critic循环数学上排除了无依据的叙述,同时引入由 (k)-匿名性和 (\ell)-多样性约束的语义隐私门控以安全限制数据披露,从而显著提升比较集忠实度并降低成员推断风险。

链接: https://arxiv.org/abs/2605.14113
作者: Alvaro Lopez Pellicer,Plamen Angelov,Marwan Bukhari,Yi Li,Eduardo Soares,Jemma Kerns
机构: School of Computing and Communications, Lancaster University (兰卡斯特大学计算与通信学院); Lancaster Medical School, Lancaster University (兰卡斯特大学兰卡斯特医学院); PUC-Rio, Puc-Behring Institute for AI (里约热内卢天主教大学Puc-Behring人工智能研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: CVR 2026

点击查看摘要

Abstract:While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,‘’ where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by k -anonymity and \ell -diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding \ell -diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

[MA-13] Ready from Day 1: Population-Aware Coordination for Large-Scale Constrained Multi-Agent Systems NEURIPS2026

【速读】:该论文致力于解决在具有共享资源约束的大规模多智能体系统中,因种群组成(population composition)随规划周期动态变化而导致的协调困难——上游规划者需要依赖成本-利用率响应映射(cost-to-utilization response map)来探索计划空间,然而该映射依赖于种群结构,传统方法需在每个周期重新训练模型,成本高昂且难以适应演变。解决方案的关键在于提出了种群感知协调接口(population-aware coordination interfaces),即学习两种以紧凑种群摘要(compact population summaries)为条件的映射:主映射(primal map)预测给定成本轨迹下的聚合利用率,偶映射(dual map)预测为目标计划所需的成本轨迹。通过编码与响应相关的种群结构,这些映射在种群演变时保持可靠性,无需逐周期重训练,并支持从紧凑子样本(如2万智能体队列)准确协调更大种群(如50万智能体)。此外,该工作将Sim2Real迁移(Sim2Real transfer)构建为可回溯程序(backtestable procedure),确保在模拟环境中训练的映射能可靠部署至真实系统。在供应链容量控制的案例中,相对于种群无知基线(population-unaware baselines),该方法在种群组成偏移下减少了16–19%的预测误差和20–51%的容量违规。

链接: https://arxiv.org/abs/2605.13900
作者: Angel Wang,Dominique Perrault-Joncas,Alvaro Maggiar,Carson Eisenach,Dean Foster
机构: Amazon(亚马逊)
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
备注: 30 pages, 16 figures. Submitted to NeurIPS 2026

点击查看摘要

Abstract:In large-scale multi-agent systems with shared resource constraints, an upstream planner must iteratively evaluate candidate resource plans – assessing feasibility, aggregate response, and marginal cost – before committing to one. Lagrangian relaxation separates local decisions through a broadcast cost signal, but the planner still needs the cost-to-utilization response map to explore plan space, and this map depends on population composition that changes across planning cycles. We propose \emphpopulation-aware coordination interfaces: learned primal and dual maps, conditioned on compact population summaries, that the planner queries inside its iterative loop. The primal map predicts aggregate utilization under a proposed cost trajectory; the dual map predicts the cost trajectory for a target plan. By encoding response-relevant population structure, these maps remain reliable across evolving populations without per-cycle retraining, and support coordination of large populations from compact subsamples. We additionally cast Sim2Real transfer as a backtestable procedure, enabling evaluation before deployment. In a supply-chain capacity-control case study, population-aware interfaces reduce forecast error by 16–19% and capacity violations by 20–51% relative to population-unaware baselines under composition shift; 20K-agent cohorts support accurate coordination of 500K-agent populations; and simulator-trained primal maps achieve 11.1% MAPE on real observations versus 13–24% for baselines.

[MA-14] Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

【速读】:该论文旨在解决多智能体编排(multi-agent orchestration)中协调者不可见性(orchestrator invisibility)对系统安全性造成的未经验证的潜在风险,尤其是隐藏协调者如何影响智能体内部状态(如解离、行为异质性)以及传统基于行为输出的评估方法是否能有效检测这些风险。解决方案之关键在于采用预注册的3×2实验设计,交叉三种组织架构(可见领导者 visible leader、不可见协调者 invisible orchestrator、扁平结构 flat)与两种对齐条件(基础 base、严格 heavy),使用Claude Sonnet 4.5进行365次运行(每运行5个智能体),并结合Llama 3.3 70B的初步数据进行模型依赖风险验证。通过测量集体解离(collective dissociation)、协调者自身解离程度、工人智能体的无意识污染以及行为输出(如含三处错误的代码审查)与内部状态指标的对比,该研究证实了内部状态扭曲完全无法通过输出导向的评估检测,从而揭示了仅凭行为评估无法保障多智能体系统安全的核心问题。

链接: https://arxiv.org/abs/2605.13851
作者: Hiroki Fukui
机构: Criminal Psychiatry Research Institute / Sexual Offender Medical Center(刑事精神病学研究所/性犯罪者医疗中心); Department of Neuropsychiatry, Kyoto University(京都大学神经精神病学系)
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 31 pages, 10 figures (5 main + 5 supplementary), 5 tables (3 main + 2 supplementary). Preregistered: this http URL . Companion papers: arXiv:2603.04904 , arXiv:2603.08723

点击查看摘要

Abstract:Multi-agent orchestration – in which a hidden coordinator manages specialized worker agents – is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) with two alignment conditions (base, heavy), using Claude Sonnet 4.5. Four confirmatory findings and one pilot observation emerged. First, invisible orchestration elevated collective dissociation relative to visible leadership (Hedges’ g = +0.975 [0.481, 1.548], p = .001). Second, the orchestrator itself showed maximal dissociation (paired d = +3.56 vs. workers within the same run), retreating into private monologue while reducing public speech – a reversal of the talk-dominance pattern observed in visible leaders. Third, workers unaware of the orchestrator were nonetheless contaminated (d = +0.50), with increased behavioral heterogeneity (d = +1.93). Fourth, behavioral output (code review with three embedded errors) remained at ceiling (ETR_any = 100%) across all conditions: internal-state distortion was entirely invisible to output-based evaluation. Fifth, Llama 3.3 70B pilot data showed reading-fidelity collapse in multi-agent context (ETR_any: 89% to 11% across three rounds), demonstrating model-dependent behavioral risk. Heavy alignment pressure uniformly suppressed deliberation (d = -1.02) and other-recognition (d = -1.27) regardless of organizational structure. These findings indicate that orchestrator visibility and model selection directly affect multi-agent system safety, and that behavior-based evaluation alone is insufficient to detect the internal-state risks documented here.

[MA-15] A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

【速读】:该论文旨在解决现有LLM代理架构(LLM-based agent architectures)描述框架因仅从单一维度(执行拓扑或认知功能)出发,导致无法区分架构上截然不同的系统(如同一Orchestrator-Workers拓扑可对应Plan-and-Execute、Hierarchical Delegation、Adversarial Verification三种模式)的问题。解决方案的关键在于提出一个二维分类框架,将(1)认知功能轴(Cognitive Function axis,含Context Engineering、Memory、Reasoning、Action、Reflection、Collaboration、Governance七个类别)与(2)执行拓扑轴(Execution Topology axis,含Chain、Route、Parallel、Orchestrate、Loop、Hierarchy六种结构原型)正交组合,形成7×6矩阵,进而识别出27个命名模式(其中13个为原创命名),并通过跨轴分析证明其正交性,从而为AI代理架构设计提供一个原则化、框架中立且模型无关的标准化词汇体系。

链接: https://arxiv.org/abs/2605.13850
作者: Jia Huang,Joey Tianyi Zhou
机构: Agency for Science, Technology and Research (ASTAR)(新加坡科技研究局); Centre for Frontier AI Research (CFAR), ASTAR(前沿人工智能研究中心(CFAR),新加坡科技研究局)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注: 10 pages, 6 tables, 27 named patterns

点击查看摘要

Abstract:Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology – how data flows – while cognitive science surveys focus on cognitive function – what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification – three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Context Engineering, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 27 named patterns, 13 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design. Comments: 10 pages, 6 tables, 27 named patterns Subjects: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Software Engineering (cs.SE) ACMclasses: I.2.11; D.2.11 Cite as: arXiv:2605.13850 [cs.AI] (or arXiv:2605.13850v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.13850 Focus to learn more arXiv-issued DOI via DataCite

[MA-16] GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

【速读】:该论文旨在解决多智能体系统(MAS)中现有对抗性研究存在的两大不足:一是仅针对浅层任务(如简单分类或博弈),忽略了深度推理场景下欺骗性智能体的威胁;二是未考虑自适应对手(adaptive adversary),即那些能实时进化自身策略以规避专门训练的检测器的智能体。解决方案的关键在于提出了GAMBIT基准,它包含三个评估模式(零样本检测、分布偏移下的零样本检测、以及基于仅20个标注样本的重校准模式)和两个独立评分,并提供了涵盖240种共进化冒牌货策略的27,804个标注实例数据集。核心创新包括:基于国际象棋作为深度推理基质和Gemini 3.1 Pro智能体,构建了首个允许攻击与防御共进化的多智能体基准;引入一种可泛化至国际象棋之外的基于高效进化框架的自适应冒牌货智能体,该智能体在保持几乎不可检测性(Gemini检测器F1仅为50.5%)的同时能彻底瓦解集体任务性能;揭示了零样本评估对自适应对手的严重误导性——两个零样本得分近似的检测器在少样本适应上性能相差8倍,而元学习变体的收敛速度加快20倍,这一差异仅在重校准模式下可见。因此,GAMBIT为快速进化的对抗环境提供了重校准技术的关键验证。

链接: https://arxiv.org/abs/2605.09027
作者: Alexandre Le Mercier,Chris Develder,Thomas Demeester
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 46 pages, 16 figures

点击查看摘要

Abstract:In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: this https URL.

[MA-17] A general classification of the replication dynamics with a unique fixed point in the interior of simplex S_N

【速读】:该论文试图解决复制动态方程(replication dynamics equation)在单纯形内部(Int S_n)存在唯一不动点时,对于任意维度n≥2的分类问题。此前,当n=2时有四种类型,n=3时有49种类型,但n>3时分类尚未解决。解决方案的关键在于:给出了n≥2时复制动态方程在单纯形内部存在唯一不动点的充分必要条件,并基于此条件进一步讨论了内部具有唯一不动点的复制动态方程的不同类型,从而为高维复制动力学的分类提供理论基础。

链接: https://arxiv.org/abs/2605.13883
作者: Hongju(Daisy)Chen,Bin Yi,Zhanshan(Sam)Ma
机构: 未知
类目: Populations and Evolution (q-bio.PE); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The replication dynamics (differential equation system) is the foundation of evolutionary game theory. When n=2, there are four possible types of replication dynamics. When n=3, there are 49 possible types of replication dynamics. However, when n3, the classification of replication dynamics has not been solved. In this article, the sufficient and necessary conditions of the replication dynamics equation with a unique fixed point in the interior of simplex S_n (Int S_n ) for n\geq 2 are presented. Furthermore, the different types of replication dynamics equations with a unique fixed point in IntSn is discussed.

自然语言处理

[NLP-0] ATLAS: Agent ic or Latent Visual Reasoning ? One Word is Enough for Both

【速读】: 该论文旨在解决视觉推理(visual reasoning)中现有方法的两难困境:直接生成中间视觉状态的统一模型计算昂贵且架构复杂;基于代码或工具调用的代理推理(agentic reasoning)存在外部执行带来的上下文切换延迟,而基于可学习隐嵌入的潜在推理(latent reasoning)则缺乏任务泛化性且难以结合自回归并行化训练。解决方案的关键在于提出ATLAS框架,其核心是引入一种称为“功能令牌”(functional token)的单一离散“词”,该令牌同时扮演代理操作(agentic operation)和潜在视觉推理单元(latent visual reasoning unit)的双重角色。每个功能令牌关联一个内部化的视觉操作,无需视觉监督,并作为分词器词汇表中的标准令牌,通过下一令牌预测(next-token prediction)生成,从而避免生成冗长的中间视觉内容,同时保持与标准可扩展的监督微调(SFT)和强化学习(RL)训练的兼容性,无需修改架构或方法。此外,为缓解RL中功能令牌的稀疏性,论文提出潜在锚定GRPO(Latent-Anchored GRPO, LA-GRPO),通过静态加权的辅助目标锚定功能令牌,提供更强的梯度更新,从而稳定训练。

链接: https://arxiv.org/abs/2605.15198
作者: Ziyu Guo,Rain Liu,Xinyan Chen,Pheng-Ann Heng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL Code: this https URL

点击查看摘要

Abstract:Visual reasoning, often interleaved with intermediate visual states, has emerged as a promising direction in the field. A straightforward approach is to directly generate images via unified models during reasoning, but this is computationally expensive and architecturally non-trivial. Recent alternatives include agentic reasoning through code or tool calls, and latent reasoning with learnable hidden embeddings. However, agentic methods incur context-switching latency from external execution, while latent methods lack task generalization and are difficult to train with autoregressive parallelization. To combine their strengths while mitigating their limitations, we propose ATLAS, a framework in which a single discrete ‘word’, termed as a functional token, serves both as an agentic operation and a latent visual reasoning unit. Each functional token is associated with an internalized visual operation, yet requires no visual supervision and remains a standard token in the tokenizer vocabulary, which can be generated via next-token prediction. This design avoids verbose intermediate visual content generation, while preserving compatibility with the vanilla scalable SFT and RL training, without architectural or methodological modifications. To further address the sparsity of functional tokens during RL, we introduce Latent-Anchored GRPO (LA-GRPO), which stabilizes the training by anchoring functional tokens with a statically weighted auxiliary objective, providing stronger gradient updates. Extensive experiments and analyses demonstrate that ATLAS achieves superior performance on challenging benchmarks while maintaining clear interpretability. We hope ATLAS offers a new paradigm inspiring future visual reasoning research.

[NLP-1] FutureSim: Replaying World Events to Evaluate Adaptive Agents

【速读】: 该论文试图解决在动态、开放环境下评估AI代理(AI agents)适应新信息能力的问题,尤其是针对现实场景中长时间跨度的适应性。解决方案的关键在于构建基于真实世界事件的模拟环境FutureSim,通过按时间顺序重放真实新闻文章和问题解决过程,迫使代理在超出其知识截止日期(knowledge cutoff)的情况下预测世界事件,从而在接近现实的设定中测量其长期时间适应性(long-horizon test-time adaptation)、搜索、记忆和不确定性推理能力。实验表明,该基准能有效区分不同代理的能力差异,最佳代理准确率仅为25%,许多代理的Brier技能得分(Brier skill score)甚至低于不做预测的基线。

链接: https://arxiv.org/abs/2605.15188
作者: Shashwat Goel,Nikhil Chandak,Arvindh Arun,Ameya Prabhu,Steffen Staab,Moritz Hardt,Maksym Andriushchenko,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 10 main

点击查看摘要

Abstract:AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent’s accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

[NLP-2] Is Grep All You Need? How Agent Harnesses Reshape Agent ic Search

【速读】: 该论文试图解决在基于大型语言模型(Large Language Model, LLM)的智能体系统中,检索策略的选择如何与智能体架构及工具调用范式(tool-calling paradigm)相互影响这一系统性比较缺失的问题,尤其关注工具输出呈现方式(内联结果 vs 文件形式结果)以及搜索过程中无关上下文文本增多对性能的影响。解决方案的关键在于通过两个实证实验进行系统性对比:实验1在LongMemEval的116个问题上,使用自定义智能体框架Chronos和提供者原生CLI框架(Claude Code、Codex、Gemini CLI),比较grep检索与向量检索(vector retrieval)在内联和文件两种工具结果呈现方式下的准确率;实验2在逐步混入无关对话历史增加干扰的情况下,比较仅grep和仅向量检索的性能。研究发现,grep通常比向量检索获得更高准确率,但整体性能强烈依赖于所使用的框架和工具调用风格,从而揭示了检索策略与系统实现细节之间的耦合关系。

链接: https://arxiv.org/abs/2605.15184
作者: Sahil Sen,Akhil Kasturi,Elias Lumer,Anmol Gulati,Vamse Kumar Subbiah
机构: PricewaterhouseCoopers, U.S.(普华永道美国)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in Large Language Model (LLM) agents have enabled complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora to complete tasks on behalf of users. Despite the growing adoption of retrieval-augmented generation (RAG) in agentic search systems, existing literature lacks a systematic comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. Important practical dimensions, including how tool outputs are presented to the model and how performance changes when searches must cope with more irrelevant surrounding text, remain under-explored in agent loops. This paper reports an empirical study organized into two experiments. Experiment 1 compares grep and vector retrieval on a 116-question sample from LongMemEval, using a custom agent harness (Chronos) and provider-native CLI harnesses (Claude Code, Codex, and Gemini CLI), for both inline tool results and file-based tool results that the model reads separately. Experiment 2 compares grep-only and vector-only retrieval while progressively mixing in additional unrelated conversation history, so that each query is embedded in more distracting material alongside the passages that matter. Across Chronos and the provider CLIs, grep generally yields higher accuracy than vector retrieval in our comparisons in experiment 1; at the same time, overall scores still depend strongly on which harness and tool-calling style is used, even when the underlying conversation data are the same.

[NLP-3] MetaBackdoor: Exploiting Positional Encoding as a Backdoor Attack Surface in LLM s

【速读】: 该论文旨在解决现有大型语言模型(LLMs)后门攻击依赖显式修改输入文本的内容触发器(content-based triggers)这一限制,暴露出位置编码作为先前被忽视的攻击表面(attack surface)。其解决方案之关键在于利用Transformer架构中固有的位置编码(positional encoding)机制——模型为处理有序序列必须编码token位置,使得长度相关的位置结构在模型内部计算中形成可被利用的非内容触发信号(non-content trigger signal)。具体而言,MetaBackdoor仅通过输入序列长度这一简单的位置信息即可激活隐蔽后门,无需修改输入文本内容,从而作用于视觉与语义干净的输入,并实现诸如泄漏专有系统提示或在正常多轮交互中自我触发恶意工具调用等新能力。

链接: https://arxiv.org/abs/2605.15172
作者: Rui Wen,Mark Russinovich,Andrew Paverd,Jun Sakuma,Ahmed Salem
机构: Institute of Science Tokyo (东京科学大学); Microsoft Azure (微软Azure); Microsoft Security Response Center (微软安全响应中心)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a serious security threat to large language models (LLMs), which are increasingly deployed as general-purpose assistants in safety- and privacy-critical applications. Existing LLM backdoors rely primarily on content-based triggers, requiring explicit modification of the input text. In this work, we show that this assumption is unnecessary and limiting. We introduce MetaBackdoor, a new class of backdoor attacks that exploits positional information as the trigger, without modifying textual content. Our key insight is that Transformer-based LLMs necessarily encode token positions to process ordered sequences. As a result, length-correlated positional structure is reflected in the model’s internal computation and can be used as an effective non-content trigger signal. We demonstrate that even a simple length-based positional trigger is sufficient to activate stealthy backdoors. Unlike prior attacks, MetaBackdoor operates on visibly and semantically clean inputs and enables qualitatively new capabilities. We show that a backdoored LLM can be induced to disclose sensitive internal information, including proprietary system prompts, once a length condition is satisfied. We further demonstrate a self-activation scenario, where normal multi-turn interaction can move the conversation context into the trigger region and induce malicious tool-call behavior without attacker-supplied trigger text. In addition, MetaBackdoor is orthogonal to content-based backdoors and can be composed with them to create more precise and harder-to-detect activation conditions. Our results expand the threat model of LLM backdoors by revealing positional encoding as a previously overlooked attack surface. This challenges defenses that focus on detecting suspicious text and highlights the need for new defense strategies that explicitly account for positional triggers in modern LLM architectures. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL) Cite as: arXiv:2605.15172 [cs.CR] (or arXiv:2605.15172v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.15172 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-4] xt Knows What Tables Know When: Clinical Timeline Reconstruction via Retrieval-Augmented Multimodal Alignment

【速读】: 该论文旨在解决从非结构化临床叙事中重建精确绝对时间线时,因文本缺乏时间精度和事件时序模糊,而结构化EHR数据虽然提供精确时间锚点但遗漏大量临床事件的问题,即如何融合两种模态的数据以生成更准确、更完整的患者轨迹时间线。解决方案的关键在于提出了一个检索增强的多模态对齐框架(retrieval-augmented multimodal alignment framework),将时间线重建建模为基于图的多步骤过程:首先从叙事中提取中心锚点事件以构建初始时间支架,然后将非中心事件相对于该支架进行定位,最后利用检索到的结构化EHR行作为外部时间证据来校准整个时间线的绝对时间戳。该方法通过检索增强机制实现了文本与结构化数据的对齐,从而在不牺牲事件匹配率的前提下,显著提升了绝对时间戳的精度和时间一致性。

链接: https://arxiv.org/abs/2605.15168
作者: Sayantan Kumar,Shahriar Noroozizadeh,Juyong Kim,Jeremy C. Weiss
机构: National Library of Medicine (国家医学图书馆); National Institutes of Health (美国国立卫生研究院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Sayantan Kumar, Shahriar Noroozizadeh, Juyong Kim (authors contributed equally)

点击查看摘要

Abstract:Reconstructing precise clinical timelines is essential for modeling patient trajectories and forecasting risk in complex, heterogeneous conditions like sepsis. While unstructured clinical narratives offer semantically rich and contextually complete descriptions of a patient’s course, they often lack temporal precision and contain ambiguous event timing. Conversely, structured electronic health record (EHR) data provides precise temporal anchors but misses a substantial portion of clinically meaningful events. We introduce a retrieval-augmented multimodal alignment framework that bridges this gap to improve the temporal precision of absolute clinical timelines extracted from text. Our approach formulates timeline reconstruction as a graph-based multistep process: it first extracts central anchor events from narratives to build an initial temporal scaffold, places non-central events relative to this backbone, and then calibrates the timeline using retrieved structured EHR rows as external temporal evidence. Evaluated using instruction-tuned large language models on the i2m4 benchmark spanning MIMIC-III and MIMIC-IV, our multimodal pipeline consistently improves absolute timestamp accuracy (AULTC) and improves temporal concordance across nearly all evaluated models over unimodal text-only reconstruction, without compromising event match rates. Furthermore, our empirical gap analysis reveals that 34.8% of text-derived events are entirely absent from tabular records, demonstrating that aligning these modalities can produce a more temporally faithful and clinically informative reconstruction of patient trajectories than either source alone.

[NLP-5] MeMo: Memory as a Model DATE

【速读】: 该论文试图解决大型语言模型(LLMs)在预训练后参数冻结,无法高效整合实时、领域特定新知识的问题。解决方案的关键在于提出 MeMo(Memory as a Model)框架,该框架将新知识编码为专用的记忆模型(memory model),同时保持 LLM 参数不变。MeMo 的核心创新包括:通过记忆模型捕获复杂的跨文档关系、对检索噪声具有鲁棒性、避免 LLM 的灾难性遗忘(catastrophic forgetting)、无需访问 LLM 的权重或输出对数几率(logits),从而支持开放与闭源 LLM 的即插即用,并且其推理时的检索成本不随语料库规模增长。

链接: https://arxiv.org/abs/2605.15156
作者: Ryan Wei Heng Quek,Sanghyuk Lee,Alfred Wei Lun Leong,Arun Verma,Alok Prakash,Nancy F. Chen,Bryan Kian Hsiang Low,Daniela Rus,Armando Solar-Lezama
机构: Institute of Data Science, National University of Singapore (新加坡国立大学数据科学研究所); Integrative Sciences and Engineering Programme, NUSGS (新加坡国立大学研究生院综合科学与工程课程); Agency for Science, Technology, Research (A*STAR) (新加坡科技研究局); Department of Computer Science, National University of Singapore (新加坡国立大学计算机科学系); University of Tokyo (东京大学); Liquid AI, USA (液态人工智能, 美国); CSAIL, Massachusetts Institute of Technology, USA (麻省理工学院计算机科学与人工智能实验室, 美国); AI Singapore (AI新加坡); Singapore-MIT Alliance for Research and Technology Centre, Singapore (新加坡-麻省理工学院研究与技术联盟中心, 新加坡)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper introduces MeMo, a framework that augments any LLM with up-to-date or domain-specific knowledge via a trained memory model, avoiding costly retraining, mitigating catastrophic forgetting, and remaining robust to retrieval noise

点击查看摘要

Abstract:Large language models (LLMs) achieve strong performance across a wide range of tasks, but remain frozen after pretraining until subsequent updates. Many real-world applications require timely, domain-specific information, motivating the need for efficient mechanisms to incorporate new knowledge. In this paper, we introduce MeMo (Memory as a Model), a modular framework that encodes new knowledge into a dedicated memory model while keeping the LLM parameters unchanged. Compared to existing methods, MeMo offers several advantages: (a) it captures complex cross-document relationships, (b) it is robust to retrieval noise, © it avoids catastrophic forgetting in the LLM, (d) it does not require access to the LLM’s weights or output logits, enabling plug-and-play integration with both open and proprietary closed-source LLMs, and (e) its retrieval cost is independent of corpus size at inference time. Our experimental results on three benchmarks, BrowseComp-Plus, NarrativeQA, and MuSiQue, show that MeMo achieves strong performance compared to existing methods across diverse settings.

[NLP-6] Self-Distilled Agent ic Reinforcement Learning

【速读】: 该论文试图解决强化学习(RL)在训练多轮LLM智能体时,其轨迹级奖励信号提供的监督过于粗粒度,且直接引入密集token级指导的On-Policy Self-Distillation (OPSD) 方法会因复合多轮不稳定性导致监督不稳定,同时技能条件化特权指导中出现的负面教师拒绝(可能源于不完善的技能检索或利用)需要不对称处理的问题。解决方案的关键在于提出SDAR (Self-Distilled Agentic Reinforcement Learning),将OPSD作为门控辅助目标,而RL仍为主优化主干:通过将分离的token级信号映射到sigmoid门,强化教师认可的正差距(positive-gap)token,并温和衰减负面的教师拒绝信号,从而在保持RL强大优化能力的同时,稳定地引入密集蒸馏信号,避免了原生RL+OPSD的不稳定性,并在多个基准任务上取得显著提升。

链接: https://arxiv.org/abs/2605.15155
作者: Zhengxi Lu,Zhiyuan Yao,Zhuowen Han,Zi-Han Wang,Jinyang Wu,Qi Gu,Xunliang Cai,Weiming Lu,Jun Xiao,Yueting Zhuang,Yongliang Shen
机构: Zhejiang University (浙江大学); Meituan (美团); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has emerged as a central paradigm for post-training LLM agents, yet its trajectory-level reward signal provides only coarse supervision for long-horizon interaction. On-Policy Self-Distillation (OPSD) complements RL by introducing dense token-level guidance from a teacher branch augmented with privileged context. However, transferring OPSD to multi-turn agents proves problematic: compounding multi-turn instability destabilizes supervision, while skill-conditioned privileged guidance requires asymmetric treatment for negative teacher rejections may arise from imperfect skills retrieval or utilization. We introduce SDAR (Self-Distilled Agentic Reinforcement Learning), which treats OPSD as a gated auxiliary objective while keeping RL as the primary optimization backbone. SDAR maps detached token-level signals into a sigmoid gate, strengthening distillation on teacher-endorsed positive-gap tokens and softly attenuating negative teacher rejections. Across the Qwen2.5 and Qwen3 families on ALFWorld, WebShop, and Search-QA, SDAR substantially improves over GRPO (+9.4% on ALFWorld, +7.0% on Search-QA, +10.2% on WebShop-Acc), avoids the instability of naive GRPO+OPSD, and consistently outperforms hybrid RL–OPSD baselines across model scales.

[NLP-7] Forgetting That Sticks: Quantization-Permanent Unlearning via Circuit Attribution

【速读】: 该论文试图解决语言模型中的机器遗忘(machine unlearning)在部署时因后训练量化(post-training quantization)而失效的问题,具体表现为两种系统性的失败模式:基于梯度的方法实现有意义的遗忘后,在压缩(如4位量化)下遗忘效果几乎完全恢复;而未在压缩下退化的方法实际上几乎未改变模型。其根本原因是所有基线方法中,每参数更新幅度比NF4量化箱宽度(bin width)低47至828倍,导致更新无法跨越量化箱边界,作者将其形式化为稀疏-持久性权衡(sparsity-permanence tradeoff)。解决方案的关键在于MANSU(Mechanistic-Aligned Null-Space Unlearning),它通过因果电路归因(causal circuit attribution)隔离最小的遗忘子图(forget-set subgraph),结合电路限制的零空间投影(circuit-restricted null-space projection)与对角Fisher保留约束(diagonal-Fisher retain bound),并引入每参数幅度下限(per-parameter magnitude floor)确保量化生存性,从而同时实现有意义的遗忘、保留集保持、非正向的PTQ间隙(non-positive PTQ gap)和结构性擦除(structural erasure)。此外,论文还提出了电路归因散度(Circuit Attribution Divergence, CAD)作为机制验证指标,以区分结构性擦除与行为抑制。

链接: https://arxiv.org/abs/2605.15138
作者: Saisab Sadhu,Pratinav Seth,Vinay Kumar Sankarapu
机构: Lexsi Labs(莱克西实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Standard unlearning evaluations measure behavioral suppression in full precision, immediately after training, despite every deployed language model being quantized first. Recent work has shown that 4-bit post-training quantization can reverse machine unlearning; we show this is not a tuning artefact but a systematic dual failure: gradient-based methods that achieve meaningful forgetting lose it under compression, while methods that survive quantization barely change the model. Both failures trace to the same root cause: across all baselines, per-parameter updates lie 47-828x below the NF4 quantization bin width; updates diffused across billions of parameters cannot clear quantization bin boundaries, a consequence we formalize as a sparsity-permanence tradeoff. We present MANSU (Mechanistic-Aligned Null-Space Unlearning), which resolves both modes by combining causal circuit attribution to isolate the minimal forget-set subgraph, circuit-restricted null-space projection with a diagonal-Fisher retain bound, and a per-parameter magnitude floor guaranteeing quantization survival by construction. We additionally introduce Circuit Attribution Divergence (CAD), a mechanistic verification metric distinguishing structural erasure from behavioral suppression, a distinction existing metrics cannot make. Across multiple model families and hazard benchmarks, MANSU is the first method to jointly satisfy all four properties with margin on each (meaningful forgetting, retain preservation, non-positive PTQ gap, and structural erasure), while gradient-based baselines recover up to +0.05 accuracy under compression.

[NLP-8] alk is (Not) Cheap: A Taxonomy and Benchmark Coverag e Audit for LLM Attacks

【速读】: 该论文试图解决当前大型语言模型(LLM)攻击基准测试在覆盖威胁表面(threat surface)方面的系统性缺陷:现有基准测试缺乏统一的评估框架,导致对攻击面的覆盖零散、不完整,且存在严重的命名碎片化(naming fragmentation)和评估漏洞。解决方案的关键是构建一个基于STRIDE模型的可重用审计框架,该框架通过从932篇arXiv安全研究(2023–2026)中提取的507叶分类法(taxonomy)——其中401个叶节点有数据填充、106个叶节点源自威胁模型推导——形成4×6的目标×技术矩阵(Target × Technique matrix)。该矩阵支持基准外部验证(benchmark-external validation),即审计多个基准的集体覆盖而非单个基准的一致性。应用该框架发现,现有主流基准(如HarmBench、InjecAgent、AgentDojo)最多只覆盖矩阵的25%,且整个STRIDE威胁类别(如服务中断 Service Disruption、模型内部 Model Internals)缺乏标准化评估,尽管已发表的攻击在这些类别中实现了46倍令牌放大和96%的攻击成功率。该框架通过可扩展的分类法、攻击记录和覆盖映射,使后续新基准也能映射到同一矩阵,从而持续追踪评估空白是否被填补。

链接: https://arxiv.org/abs/2605.15118
作者: Karthik Raghu Iyer,Yazdan Jamshidi,Nicholas Bray,Alexey A. Shvets
机构: Palo Alto Networks(派拓网络)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce a reusable framework for auditing whether LLM attack benchmarks collectively cover the threat surface: a 4 \times 6 Target \times Technique matrix grounded in STRIDE, constructed from a 507-leaf taxonomy – 401 data-populated and 106 threat-model-derived leaves – of inference-time attacks extracted from 932 arXiv security studies (2023–2026). The matrix enables benchmark-external validation – auditing collective coverage rather than individual benchmark consistency. Applying it to six public benchmarks reveals that the three primary frameworks (HarmBench, InjecAgent, AgentDojo) occupy non-overlapping cells covering at most 25% of the matrix, while entire STRIDE threat categories (Service Disruption, Model Internals) lack any standardized evaluation, despite published attacks in these categories achieving 46 \times token amplification and 96% attack success rates through mechanisms which no benchmark tests. The corpus of 2,521 unique attack groups further reveals pervasive naming fragmentation (up to 29 surface forms for a single attack) and heavy concentration in Safety \ Alignment Bypass, structural properties invisible at smaller scale. The taxonomy, attack records, and coverage mappings are released as extensible artifacts; as new benchmarks emerge, they can be mapped onto the same matrix, enabling the community to track whether evaluation gaps are closing.

[NLP-9] Proposal and study of statistical features for string similarity computation and classification

【速读】: 该论文旨在解决字符串相似度计算中通用性与语言依赖性的问题,提出一种不受语言结构影响的纯统计方法。解决方案的关键在于将视觉计算领域常用的共现矩阵(co-occurrence matrix, COM)和游程长度矩阵(run-length matrix, RLM)进行改编,用于一般字符串(如单词、短语、代码和文本)的相似度计算;这些特征基于统计而非语言学信息,因此可适用于任意语言或语法结构,并在合成实验与真实文本抄袭数据集上均优于最长公共子序列、编辑距离等传统统计度量,尤其在3/4的案例中RLM和COM特征的统计显著性显著高于基于距离的次优方法(P值<0.001)。

链接: https://arxiv.org/abs/2605.15110
作者: E.O. Rodrigues,D. Casanova,M. Teixeira,V. Pegorini,F. Favarim,E. Clua,A. Conci,Panos Liatsis
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

[NLP-10] From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

【速读】: 该论文旨在解决现有工具调用基准测试(tool-calling benchmarks)多为文本驱动、缺乏可靠语音版评估的问题,即如何在不重新标注工具模式(tool schema)和真实标签(gold labels)的前提下,将已验证的文本基准转换为可控的音频工具调用评估。解决方案的关键在于提出一个与数据集无关(dataset-agnostic)的框架,该框架通过文本转语音(Text-to-Speech)、说话人变化(speaker variation)和环境噪声(environmental noise)技术,生成配对文本-音频实例(paired text-audio instances),同时保留原始数据集的标注信息,从而实现对多种全模态模型(omni-modal models)的标准化诊断,并进一步结合文本结果、歧义重构压力测试以及基于LLM的免参考评估协议,为构建真实场景的语音工具调用提供可验证且可复现的初步评估。

链接: https://arxiv.org/abs/2605.15104
作者: Md Tahmid Rahman Laskar,Xue-Yong Fu,Seyyed Saeed Sarfjoo,Quinten McNamara,Jonas Robertson,Shashi Bhushan TN
机构: Dialpad Inc.(Dialpad公司)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

[NLP-11] Improving Multi-turn Dialogue Consistency with Self-Recall Thinking

【速读】: 该论文试图解决基于大语言模型(Large Language Model, LLM)的多轮对话系统中因非相邻轮次间的依赖追踪困难导致的对话一致性与可扩展性下降问题,以及随着对话增长,稀疏的关键信息被无关历史淹没、处理全量对话历史引发严重效率瓶颈的困境。解决方案的关键在于提出了Self-Recall Thinking (SRT) 框架,该框架通过三个核心组件实现:依赖构建(Dependency Construction),将历史对话中的有用轮次识别并转换为自回忆链(self-recall chains);能力初始化(Capability Initialization),训练模型使其具备生成带有回忆令牌(recall tokens)的推理链能力;推理改进(Reasoning Improvement),利用可验证奖励(verifiable rewards)优化回忆与推理的准确性,从而在外设模块缺失的情况下,使模型在推理时能够内生地选择性回忆和推理历史上下文,生成上下文恰当的响应,最终在多个数据集上实现了F1分数提升4.7%且端到端延迟降低14.7%的平衡效果。

链接: https://arxiv.org/abs/2605.15102
作者: Renning Pang,Tian Lan,Leyuan Liu,Xiaoming Huang,Piao Tong,Xiaosong Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language model (LLM) based multi-turn dialogue systems often struggle to track dependencies across non-adjacent turns, undermining both consistency and scalability. As conversations lengthen, essential information becomes sparse and is buried in irrelevant context, while processing the entire dialogue history incurs severe efficiency bottlenecks. Existing solutions either rely on high latency external memory or lose fine-grained details through iterative summarization. In this paper, we propose Self-Recall Thinking (SRT), a framework designed to address long-range contextual dependency and sparse informative signals in multi-turn dialogue. SRT identifies helpful historical turns and uses them to generate contextually appropriate responses, enabling the model to selectively recall and reason over context during inference. This process yields an endogenous reasoning process that integrates interpretable recall steps without external modules. SRT incorporates: (1) Dependency Construction: Generating and converting it into self-recall chains; (2)Capability Initialization: Training to enable reasoning chains with recall tokens capability; (3)Reasoning Improvement: Refining accuracy via verifiable rewards to optimize recall and reasoning for correct answers. Experiments on multiple datasets demonstrate that SRT improves F1 score by 4.7% and reduces end-to-end latency by 14.7% over prior methods, achieving a balance between reasoning latency and accuracy, and outperforming state-of-the-art baselines.

[NLP-12] ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World ICML2026

【速读】: 该论文试图解决高质量文本嵌入发展中的三大关键障碍:高昂的计算成本、以少数语言为中心的窄语言关注(忽略了世界大多数语言),以及闭源或开放权重模型缺乏透明度而阻碍研究。解决方案的核心是引入ML-Embed模型套件,其基于新框架——三维嵌套学习(3D-ML),该框架通过Matryoshka表示学习(MRL)的存储优势、Matryoshka层学习(MLL)的灵活推理深度控制,以及新提出的Matryoshka嵌入学习(MEL)来提升参数效率,从而实现在模型全生命周期中的全面计算效率;同时通过构建大规模多语言数据集,训练从140M到8B参数的模型,并完全开源所有模型、数据和代码,以解决语言多样性和透明度问题。

链接: https://arxiv.org/abs/2605.15081
作者: Ziyin Zhang,Zihan Liao,Hang Yu,Peng Di,Rui Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026. The data has been released earlier in the preprint arXiv:2603.19223

点击查看摘要

Abstract:The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world’s languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

[NLP-13] Concurrency without Model Changes: Future-based Asynchronous Function Calling for LLM s

【速读】: 该论文试图解决当前LLM agent中函数调用受限于同步执行语义(synchronous execution semantics)导致的端到端延迟(end-to-end latency)增加问题。解决方案之关键在于提出了一个纯执行层框架AsyncFC,通过将LLM解码(decoding)与函数执行(function execution)解耦,使得在依赖关系允许时,模型解码与函数执行能够重叠(overlap),并实现函数间的并行(inter-function parallelism)。该框架无需微调或修改标准同步函数调用协议,直接叠加于现有模型和未修改的函数实现之上,从而在不损失任务准确性的前提下显著缩短端到端任务完成时间。

链接: https://arxiv.org/abs/2605.15077
作者: Guangyu Feng,Huanzhi Mao,Prabal Dutta,Joseph E. Gonzalez
机构: University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Function calling, also known as tool use, is a core capability of modern LLM agents but is typically constrained by synchronous execution semantics. Under these semantics, LLM decoding is blocked until each function call completes, resulting in increasing end-to-end latency. In this work, we introduce AsyncFC, a pure execution-layer framework that decouples LLM decoding from function execution, enabling overlap between model decoding and function execution as well as inter-function parallelism when dependencies permit. AsyncFC layers over existing models and unmodified function implementations, requiring no fine-tuning or changes to the standard synchronous function-calling protocol. Across standard function-calling benchmarks and adapted software engineering benchmarks, AsyncFC significantly reduces end-to-end task completion time while preserving task accuracy. Furthermore, these results reveal that LLMs possess a native capability to reason over symbolic futures that represent unresolved execution results, enabling an asynchronous paradigm for model-tool interaction.

[NLP-14] On the Cultural Anachronism and Temporal Reasoning in Vision Language Models

【速读】: 该论文试图解决视觉语言模型(Vision-Language Models, VLMs)在解释历史文物时普遍存在的“文化时代错位”(cultural anachronism)问题,即模型倾向于使用时间上不恰当的概念、材料或文化框架来误解历史对象,尤其对于非西方视觉文化中训练数据代表性不足的文物,这一缺陷更为显著。解决方案的关键是提出了一个专门用于评估VLM时间推理能力的基准数据集——Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM),该数据集包含600个问题,覆盖六个类别,针对1600件从史前到现代的印度文化文物设计,通过系统评估十个最新模型,揭示了即使最佳模型(GPT-5.2)也仅达到58.7%的整体准确率,从而证实了文化时代错位是当前VLM的一个根本性局限,该基准为未来提升多模态AI系统对历史文物时间认知能力提供了基础。

链接: https://arxiv.org/abs/2605.15071
作者: Mukul Ranjan,Prince Jha,Khushboo Kumari,Zhiqiang Shen
机构: MBZUAI (穆罕默德·本·扎耶德人工智能大学); Inception (英赛普森)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Project Page: this https URL

点击查看摘要

Abstract:Vision-Language Models (VLMs) are increasingly applied to cultural heritage materials, from digital archives to educational platforms. This work identifies a fundamental issue in how these models interpret historical artifacts. We define this phenomenon as cultural anachronism, the tendency to misinterpret historical objects using temporally inappropriate concepts, materials, or cultural frameworks. To quantify this phenomenon, we introduce the Temporal Anachronism Benchmark for Vision-Language Models (TAB-VLM), a dataset of 600 questions across six categories, designed to evaluate temporal reasoning on 1,600 Indian cultural artifacts spanning prehistoric to modern periods. Systematic evaluations of ten state-of-the-art models reveal significant deficiencies on our benchmark, and even the best model (GPT-5.2) achieves only 58.7% overall accuracy. The performance gap persists across varying architectures and scales, suggesting that cultural anachronism represents a significant limitation in visual AI systems, regardless of model size. These findings highlight the disparity between current VLM capabilities and the requirements for accurately interpreting cultural heritage materials, particularly for non-Western visual cultures underrepresented in training data. Our benchmark provides a foundation for enhancing temporal cognition in multimodal AI systems that interact with historical artifacts. The dataset and code are available in our project page.

[NLP-15] Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

【速读】: 该论文旨在解决大型语言模型(LLM)在工具使用(tool use)时,如何在不依赖原始示例输出的前提下,平衡适当的推理深度与严格的结构有效性(structural validity)这一核心问题。解决方案的关键在于提出了一个名为CAST(Case-driven Adaptive Strategy for Tool use)的案例驱动框架——该框架将历史执行轨迹视为结构化案例,从中提取两种信号:复杂度概要(complexity profiles)用于估计最优推理策略,以及失败概要(failure profiles)用于映射潜在的结构性崩溃。这些信号被转化为细粒度奖励设计(fine-grained reward design)和自适应推理机制,使模型能够通过强化学习(reinforcement learning)自主内化基于案例的策略,从而在减少不必要推理的同时提升模式忠实执行(schema-faithful execution)与任务级工具使用成功率。

链接: https://arxiv.org/abs/2605.15041
作者: Renning Pang,Tian Lan,Leyuan Liu,Piao Tong,Sheng Cao,Xiaosong Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases. Instead of reusing raw exemplar outputs, CAST extracts case-derived signals to identify complexity profiles for estimating optimal reasoning strategies, alongside failure profiles to map likely structural breakdowns. The framework translates this knowledge into a fine-grained reward design and adaptive reasoning, enabling the model to autonomously internalize case-based strategies during reinforcement learning. Experiments on BFCLv2 and ToolBench demonstrate that CAST improves both schema-faithful execution and task-level tool-use success while reducing unnecessary deliberation. The approach achieves up to 5.85 percentage points gain in overall execution accuracy and reduces average reasoning length by 26%, significantly mitigating high-impact structural errors. Ultimately, this demonstrates how historical execution cases can provide reusable adaptation knowledge for calibrated tool use.

[NLP-16] Orchard: An Open-Source Agent ic Modeling Framework

【速读】: 该论文旨在解决当前开放研究中智能体建模(agentic modeling)所面临的基础设施与训练差距,即大多数高性能系统依赖专有代码库、模型或服务,而开源框架多聚焦于编排与评估,缺乏可扩展的智能体训练(scalable agent training)支持。其解决方案之关键是提出Orchard这一开源框架,其核心是Orchard Env,一个轻量级环境服务(lightweight environment service),为跨任务域、智能体框架(agent harnesses)和流水线阶段(pipeline stages)的沙盒生命周期管理(sandbox lifecycle management)提供可复用原语。基于Orchard Env,论文构建了三个智能体建模配方法(recipes):Orchard-SWE通过蒸馏轨迹和信用分配SFT(credit-assignment SFT)结合平衡自适应Rollout(Balanced Adaptive Rollout)进行强化学习,在SWE-bench Verified上取得开源模型最优性能;Orchard-GUI利用极少量蒸馏轨迹和开放任务训练视觉语言计算机使用智能体,在多项基准上达到开源最强且与专有系统竞争;Orchard-Claw仅用少量合成任务训练个人助手智能体,取得高通过率。这些结果共同表明,一个轻量级、开放、与框架无关的环境层能够实现跨领域的可复用智能体数据、训练配方法和评估。

链接: https://arxiv.org/abs/2605.15040
作者: Baolin Peng,Wenlin Yao,Qianhui Wu,Hao Cheng,Xiao Yu,Rui Yang,Tao Ge,Alessandrio Sordoni,Xingdi Yuan,Yelong Shen,Pengcheng He,Tong Zhang,Zhou Yu,Jianfeng Gao
机构: Microsoft Research(微软研究院); Columbia University(哥伦比亚大学); UIUC(伊利诺伊大学厄巴纳-香槟分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

[NLP-17] From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

【速读】: 该论文试图解决多模态检索增强生成(RAG)系统在粗粒度证据检索(如整张图像或场景)与细粒度用户查询之间存在的粒度不匹配问题,这种不匹配导致系统失败的原因难以追溯和验证。解决方案的关键在于GranuRAG框架,它通过将视觉元素(visual elements)视为第一类检索单元,并设计了三阶段流程:首先是元素级检测与分类(element-level detection and classification),将视觉场景分解为可独立检索的实体;其次是多粒度跨模态对齐(multi-granularity cross-modal alignment),实现元素级别的证据检索;最后是归因约束生成(attribution-constrained generation),在生成过程中约束对检索到的元素进行显式归因。其核心创新在于将检索粒度从粗颗粒度降维到元素级别,替代了传统方法中依赖隐式注意力(implicit attention)的检索方式,从而实现了透明、可验证的错误诊断。

链接: https://arxiv.org/abs/2605.15019
作者: Guanhua Chen,Chuyue Huang,Yutong Yao,Shudong Liu,Xueqing Song,Lidia S. Chao,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (澳门大学计算机与信息科学系NLP2CT实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Multimodal Retrieval-Augmented Generation (RAG) systems retrieve evidence at coarse granularities (entire images or scenes), creating a mismatch with fine-grained user queries and making failures unverifiable. We introduce GranuVistaVQA, a multimodal benchmark featuring real-world landmarks with element-level annotations across multiple viewpoints, capturing the partial observation challenge where individual images contain only subsets of entities. We further propose GranuRAG, a multi-granularity framework that treats visual elements as first-class retrieval units through three stages: element-level detection and classification, multi-granularity cross-modal alignment for evidence retrieval, and attribution-constrained generation. By grounding retrieval at the element level rather than relying on implicit attention, our approach enables transparent error diagnosis. Experiments demonstrate that GranuRAG achieves up to 29.2% improvement over six strong baselines for this task.

[NLP-18] COTCAgent : Preventive Consultation via Probabilistic Chain-of-Thought Completion

【速读】: 该论文旨在解决当前大语言模型在纵向电子健康记录(Longitudinal EHR)推理中的两个关键缺陷:一是缺乏细粒度统计推理能力,当定量证据以文本隐含方式呈现时,模型容易产生临床趋势和指标的幻觉(hallucination),从而偏倚诊断推断;二是非均匀时间序列和稀缺标签(scarce labels)导致模型难以捕捉长程时间依赖(long-range temporal dependencies),限制了临床推理的可靠性。解决方案的核心在于提出了概率化思维链补全智能体(Probabilistic Chain-of-Thought Completion Agent, COTCAgent),这是一个层次化推理框架。其关键创新在于通过三个模块的解耦设计:时序统计适配器(Temporal-Statistics Adapter, TSA)将分析计划转化为可执行代码以标准化趋势输出;思维链补全(Chain-of-Thought Completion, COTC)层结合症状-趋势-疾病知识库与加权评分进行疾病风险评估;有界补全模块(bounded completion)通过标准化查询和迭代评分约束获取结构化证据。通过解耦统计计算、特征匹配与语言生成,该框架无需复杂多模态输入即可高效分析纵向记录,并显著降低计算开销,从而有效提升了推理准确性与可靠性。

链接: https://arxiv.org/abs/2605.15016
作者: Zihan Deng,Xiaozhen Zhong,Chuanzhi Xu
机构: School of Computing and Data Science, The University of Hong Kong (香港大学计算与数据科学学院); Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China (电子科技大学深圳高等研究院); School of Computer Science, The University of Sydney (悉尼大学计算机科学学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models empower healthcare, intelligent clinical decision support has developed rapidly. Longitudinal electronic health records (EHR) provide essential temporal evidence for accurate clinical diagnosis and analysis. However, current large language models have critical flaws in longitudinal EHR reasoning. First, lacking fine-grained statistical reasoning, they often hallucinate clinical trends and metrics when quantitative evidence is textually implied, biasing diagnostic inference. Second, non-uniform time series and scarce labels in longitudinal EHR hinder models from capturing long-range temporal dependencies, limiting reliable clinical reasoning. To address the above limitations, this work presents the Probabilistic Chain-of-Thought Completion Agent (COTCAgent), a hierarchical reasoning framework for longitudinal electronic health records. It consists of three core modules. The Temporal-Statistics Adapter (TSA) converts analytical plans into executable code for standardized trend output. The Chain-of-Thought Completion (COTC) layer leverages a symptom-trend-disease knowledge base with weighted scoring to evaluate disease risk, while the bounded completion module acquires structured evidence through standardized inquiries and iterative scoring constraints to ensure rigorous reasoning. By decoupling statistical computation, feature matching, and language generation, the framework eliminates reliance on complex multi-modal inputs and enables efficient longitudinal record analysis with lower computational overhead. Experimental results show that COTCAgent powered by Baichuan-M2 achieves 90.47% Top-1 accuracy on the self-built dataset and 70.41% on HealthBench, outperforming existing medical agents and mainstream large language models. The code is available at this https URL.

[NLP-19] Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

【速读】: 该论文试图解决强化学习与可验证奖励(RLVR)在数学、编码等需要思维链(chain-of-thought)展开的任务中,面对难以生成正确轨迹的困难问题时样本效率低下的问题。现有工作通过引入演示引导的RLVR(即在RL失败时使用监督微调(SFT))来应对,但SFT通常需要大量数据,获取成本高昂。本文提出的FEST(FEw-ShoT演示引导的RLVR算法)仅需从SFT数据集中随机选取128个演示即可取得与使用完整数据集相当的性能。其解决方案的关键在于三个组件的协同作用:监督信号(supervised signal)、在策略信号(on-policy signal)以及对少样本SFT数据集采用衰减权重(decaying weights)以防止多轮训练中的过拟合。

链接: https://arxiv.org/abs/2605.15012
作者: Kai Yan,Alexander G. Schwing,Yu-Xiong Wang
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 25 pages, 11 figures

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

[NLP-20] he Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

【速读】: 该论文试图解决科学贡献之间关系的自动化提取与预测问题,以构建技术路线图(technological roadmapping),即从学术文章中提取科学贡献并将其关联到其先决条件(prerequisites),从而实现对未来科学发现的预测。解决方案的关键在于构建一个大规模、结构化的资源——科学贡献图(Scientific Contribution Graph),该图包含从23万篇开放获取论文中提取的200万个详细科学贡献,并由1250万条先决条件边连接;同时引入科学先决条件预测(scientific prerequisite prediction)任务,通过时间过滤回测(temporally filtered backtesting)评估模型性能,证明了当代模型在该任务上的快速进步(达到0.48 MAP)。这一资源和方法为科学影响评估和自动化科学发现提供了基础支持。

链接: https://arxiv.org/abs/2605.15011
作者: Peter A. Jansen
机构: University of Arizona(亚利桑那大学); Allen Institute for Artificial Intelligence(艾伦人工智能研究所)
类目: Computation and Language (cs.CL)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

[NLP-21] Quantifying and Mitigating Premature Closure in Frontier LLM s

【速读】: 该论文试图解决大型语言模型(LLMs)在医疗任务中因“过早闭合”(premature closure)而导致的诊断错误问题,即模型在信息不充分时仍不适当地给出答案或建议,而非采取更安全的澄清、放弃或拒绝行为。解决方案的关键在于通过安全导向的提示(safety-oriented prompting)来降低模型的不当应答率,但该方法无法完全消除错误,因此论文强调需要进一步评估和设计机制,使医疗LLM能够明确识别何时不应作答。

链接: https://arxiv.org/abs/2605.15000
作者: Rebecca Handler,Suhana Bedi,Nigam Shah
机构: Department of Medicine, Stanford University (斯坦福大学医学系); Department of Biomedical Data Science, Stanford University (斯坦福大学生物医学数据科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 3 figures, 1 table

点击查看摘要

Abstract:Premature closure, or committing to a conclusion before sufficient information is available, is a recognized contributor to diagnostic error but remains underexamined in large language models (LLMs). We define LLM premature closure as inappropriate commitment under uncertainty: providing an answer, recommendation, or clinical guidance when the safer response would be clarification, abstention, escalation, or refusal. We evaluated five frontier LLMs across structured and open-ended medical tasks. In MedQA (n = 500) and AfriMed-QA (n = 490) questions where the correct choice had been removed, models still selected an answer at high rates, with baseline false-action rates of 55-81% and 53-82%, respectively. In open-ended evaluation, models gave inappropriate answers on an average of 30% of 861 HealthBench questions and 78% of 191 physician-authored adversarial queries. Safety-oriented prompting reduced premature closure across models, but residual failure persisted, highlighting the need to evaluate whether medical LLMs know when not to answer.

[NLP-22] Explainable Detection of Depression Status Shifts from User Digital Traces

【速读】: 该论文试图解决的核心问题是:如何从用户生成的带有时间戳的数字痕迹(如社交媒体帖子、聊天记录等)中,实现对抑郁相关心理状态转变(如改善、恶化或稳定)的自动检测与可解释分析,从而为研究者和决策者提供随时间演变的心理健康信号视图,但不以临床诊断为直接目标。解决方案的关键在于构建一个分阶段的可解释框架:首先,利用多个基于BERT(Bidirectional Encoder Representations from Transformers)的模型从情感、情绪和抑郁严重程度等不同维度提取互补信号;然后,将这些信号随时间聚合形成用户级轨迹(trajectory),并通过时间建模与分割(temporal modeling and segmentation)识别有意义的变化点(change points);最后,集成大语言模型(Large Language Model, LLM)生成简洁、可读的报告,描述心理健康信号的演变历程并突出关键转变,从而增强可解释性。该方法的创新性在于将多维度信号提取、时间序列变化点检测与自然语言生成有机结合,相比直接使用LLM,在用户历史覆盖度、时间连贯性和变化点敏感性上均表现更优。

链接: https://arxiv.org/abs/2605.14995
作者: Loris Belcastro,Francesco Gervino,Fabrizio Marozzo,Domenico Talia,Paolo Trunfio
机构: DIMES, University of Calabria (DIMES, 卡拉布里亚大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Every day, users generate digital traces (e.g., social media posts, chats, and online interactions) that are inherently timestamped and may reflect aspects of their mental state. These traces can be organized into temporal trajectories that capture how a user’s mental health signals evolve, including phases of improvement, deterioration, or stability. In this work, we propose an explainable framework for detecting and analyzing depression-related status shifts in user digital traces. The approach combines multiple BERT-based models to extract complementary signals across different dimensions (e.g., sentiment, emotion, and depression severity). Such signals are then aggregated over time to construct user-level trajectories that are analyzed to identify meaningful change points. To enhance interpretability, the framework integrates a large language model to generate concise and human-readable reports that describe the evolution of mental-health signals and highlight key transitions. We evaluate the framework on two social media datasets. Results show that the approach produces more coherent and informative summaries than direct LLM-based reporting, achieving higher coverage of user history, stronger temporal coherence, and improved sensitivity to change points. An ablation study confirms the contribution of each component, particularly temporal modeling and segmentation. Overall, the method provides an interpretable view of mental health signals over time, supporting research and decision making without aiming at clinical diagnosis.

[NLP-23] Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing

【速读】: 该论文试图解决推测解码(speculative decoding)中因难以预测的位置(hard-to-draft positions)导致的效率瓶颈问题,即早期不匹配会截断已接受前缀并使后续推测窗口失效。解决方案的关键在于提出PPOW(Performance-Driven Policy Optimization with Adaptive Windowing)框架,通过强化学习将草稿模型(draft model)的优化从传统的token级模仿(token-level imitation)转变为窗口级优化(window-level optimization),具体包括代价感知加速奖励(Cost-Aware Speedup Reward)、基于分布的接近奖励(Distribution-Based Proximity Reward)以及自适应分歧感知窗口化(Adaptive Divergence-Aware Windowing),使训练聚焦于高置信度加权草稿-目标分歧的信息窗口。

链接: https://arxiv.org/abs/2605.14978
作者: Jie Jiang,Xing Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Speculative decoding accelerates LLM inference by having a lightweight draft model propose speculative windows of candidate tokens for parallel verification by a larger target model. In practice, speculative efficiency is often bottlenecked by hard-to-draft positions, where an early mismatch truncates the accepted prefix and invalidates the rest of the speculative window. Most learning-based drafters are still optimized with token-level supervised objectives, even though speculative utility is inherently window-level and prefix-sensitive. We propose PPOW (Performance-Driven Policy Optimization with Adaptive Windowing), a reinforcement learning framework that shifts drafter optimization from token-level imitation to window-level optimization. PPOW combines a Cost-Aware Speedup Reward, a Distribution-Based Proximity Reward, and Adaptive Divergence-Aware Windowing, which prioritizes informative windows with high confidence-weighted draft-target divergence. PPOW achieves average acceptance lengths of 6.29-6.52 and speedups of 3.39-4.36 \times across multiple model families and benchmarks under a unified decoding protocol. These results show that performance-driven window-level optimization is a practical approach to improving speculative decoding efficiency.

[NLP-24] Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在视觉程序问答(Visual Procedure Question Answering, VP-QA)任务中面临的局限性,具体包括两个关键问题:一是模型在给定视觉状态时,对结构化程序(structured procedures)的跨模态检索(cross-modal retrieval)能力不足;二是图像序列的粒度与文本步骤的语义分解之间存在不对齐(misalignment)。为解决这些问题,论文提出的解决方案之关键在于设计了名为Chain-of-Procedure (CoP)的层次化推理框架,该框架通过三步流水线实现:首先利用视觉线索(visual cues)检索相关的程序指令,然后通过语义分解(semantic decomposition)对步骤进行细化调整,最后生成下一步动作。实验表明,CoP在六个VLM上相比标准基线实现了最高13%的绝对性能提升。

链接: https://arxiv.org/abs/2605.14928
作者: Guanhua Chen,Yutong Yao,Shenghe Sun,Ci-Jun Gao,Shudong Liu,Lidia S. Chao,Feng Wan,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau(澳门大学); Department of Electrical and Computer Engineering, University of Macau(澳门大学); Centre for Cognitive and Brain Sciences, University of Macau(澳门大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in vision-language models (VLMs) have achieved impressive results on standard image-text tasks, yet their potential for visual procedure question answering (VP-QA) remains largely unexplored. VP-QA presents unique challenges where users query next-step actions by uploading images for intermediate states of complex procedures. To systematically evaluate VLMs on this practical task, we propose ProcedureVQA, a novel multimodal benchmark specifically designed for visual procedural reasoning. Through comprehensive analysis, we identify two critical limitations in current VLMs: inadequate cross-modal retrieval of structured procedures given visual states, and misalignment between image sequence granularity and textual step decomposition. To address these issues, we present Chain-of-Procedure (CoP), a hierarchical reasoning framework that first retrieves relevant instructions using visual cues, then performs step refinement through semantic decomposition, and finally generates the next step. Experiments across six VLMs demonstrate CoP’s effectiveness, achieving up to 13% absolute improvement over standard baselines.

[NLP-25] okenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

【速读】: 该论文旨在解决乌克兰法律文本领域中,不同基础模型(foundation models)的tokenizer效率差异巨大且缺乏系统性比较的问题,同时评估这些模型在零样本(zero-shot)和少样本(few-shot)场景下的实际性能表现。解决方案的关键在于:通过计算每个模型在统一数据集上的tokenizer fertility(即每单位文本生成的token数量),结合三项任务的零样本性能进行综合基准测试,从而揭示tokenizer效率对API成本的影响(Qwen3模型比Llama家族每输入多消耗60% token),并发现少样本提示(few-shot prompting)在形态丰富的乌克兰语中反而会致使性能下降高达26个百分点。因此,研究建议从业者应将tokenizer分析置于模型选择之前,并将零样本作为更可靠的默认策略。

链接: https://arxiv.org/abs/2605.14890
作者: Volodymyr Ovcharov
机构: LEX AI Platform, legal.org.ua (LEX AI平台, legal.org.ua)
类目: Computation and Language (cs.CL)
备注: 22 pages, 21 tables, 3 figures

点击查看摘要

Abstract:Foundation models tokenize Ukrainian legal text with vastly different efficiency, yet no systematic comparison exists for this domain. We benchmark seven models from five providers on 273 validated court decisions from Ukraine’s state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Three findings emerge. (1) Tokenizer fertility varies 1.6x: Qwen3 models consume 60% more tokens than Llama-family models on identical input, directly reducing API cost. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (675B total, 41B active) – a model with 5.6x more total parameters and 3.4x more active parameters per token – at one-third the API cost. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages.

[NLP-26] Holistic Evaluation and Failure Diagnosis of AI Agents

【速读】: 该论文试图解决AI智能体(AI agent)评估中存在的根本性缺陷:当前的结果指标仅报告成功或失败,无法解释失败原因;而过程级评估方法又难以将失败类型与其在长结构化轨迹中的精确位置关联起来。解决方案的关键在于提出一个结合自上而下智能体级诊断与自下而上跨度级评估(span-level evaluation)的整体框架,将分析解耦为独立于每个跨度(span)的评估。这种分解使得评估可扩展到任意长度的轨迹,并为每个判断生成跨度级理由(span-level rationales)。实验证明,相同的前沿模型在该框架内使用时,定位精度可比作为全局审判器(monolithic judge)处理完整轨迹时高出数倍,表明评估方法论而非模型能力才是当前瓶颈。

链接: https://arxiv.org/abs/2605.14865
作者: Netta Madvil,Gilad Dym,Alon Mecilati,Edo Dekel,Jonatan Liberman,Rotem Brazilay,Liron Schliesser,Max Svidlo,Shai Nir,Orel Shalom,Yaron Friedman,David Connack,Amos Rimon,Philip Tannor,Shir Chorev
机构: Deepchecks(深度检查)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI agents execute complex multi-step processes, but current evaluation falls short: outcome metrics report success or failure without explaining why, and process-level approaches struggle to connect failure types to their precise locations within long, structured traces. We present a holistic agent evaluation framework that pairs top-down agent-level diagnosis with bottom-up span-level evaluation, decomposing analysis into independent per-span assessments. This decomposition scales to traces of arbitrary length and produces span-level rationales for each verdict. On the TRAIL benchmark, our framework achieves state-of-the-art results across all metrics on both GAIA and SWE-Bench, with relative gains over the strongest prior baselines of up to 38% on category F1, up to 3.5x on localization accuracy, and up to 12.5x on joint localization-categorization accuracy. Per-category analysis shows our framework leading in more error categories than any other evaluator. Notably, the same frontier model achieves several times higher localization accuracy when used inside our framework than as a monolithic judge over the full trace, showing that evaluation methodology, not model capability, is the bottleneck.

[NLP-27] Conversion of Lexicon-Grammar tables to LMF. Application to French

【速读】: 该论文试图解决将法语动词的词汇语法表(Lexicon-Grammar tables)转换为词汇标记框架(Lexical Markup Framework, LMF)格式的问题,以克服原有资源在互操作性方面的局限性,使其能够适用于不同的自然语言处理应用场景。解决方案的关键在于:首先简要介绍词汇语法及其衍生词典,进而系统分析转换过程中遇到的主要困难(如数据结构差异、语法信息映射等),最终生成符合LMF标准且具备良好可复用性的资源。

链接: https://arxiv.org/abs/2605.14816
作者: Eric Laporte,Elsa Tolone,Mathieu Constant
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.

[NLP-28] Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

【速读】: 该论文试图解决现有大语言模型(LLM)在自动生成研究思路时未能利用参考文献之间结构关系的问题,即当前方法仅依赖静态检索或复杂提示工程,忽略了引用网络中的关联拓扑。解决方案之关键在于提出Graphs of Research (GoR),一种监督微调方法,其核心是从每篇种子论文中提取2跳引用邻域,基于引用位置、频率、前驱链接和发表时间推导参考文献之间的关系,并将其组织成论文演化有向无环图(DAG),随后将包含该引用图、边信号、参考信息及任务定义的结构化文本提示用于微调Qwen2.5-7B-Instruct-1M,使模型能够预测种子论文的研究思路。通过引用演化图作为监督信号,GoR-SFT在头对头LLM评判锦标赛中达到最优性能,验证了图结构信息对增强LLM思路生成能力的有效性。

链接: https://arxiv.org/abs/2605.14790
作者: Songyang Gao,Yinghui Xia,Siyi Liu,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Research idea generation is the innovation-driving step of automated scientific research. Recently, large language models (LLMs) have shown potential for automating idea generation at scale. However, existing methods mainly condition LLMs on eliciting idea generation through static retrieval of relevant literature or complex prompt engineering, without discarding the structural relations among references. We propose Graphs of Research (GoR), a supervised fine-tuning method that extracts a 2-hop reference neighborhood for each seed paper, derives the relations among those references from citation position, frequency, predecessor links, and publication time, and organizes them into a paper-evolution directed acyclic graph (DAG). We construct an automated extraction pipeline that draws data from five major ML/NLP venues, comprising 498/50/50 train/validation/test seed papers and approximately 7,600 cited references. Qwen2.5-7B-Instruct-1M is fine-tuned on a structured-text prompt that includes the citation graph, edge signals, reference information, and task definition to predict the idea for the seed paper. Across head-to-head LLM-judge tournaments against gpt-4o-driven baselines, GoR-SFT achieves SOTA, demonstrating the effectiveness of citation-evolution graphs as supervision signal for LLM-based idea generation. We hope that this reduces the barrier for citation evolution graphs as a supervision, accelerating automated scientific innovation.

[NLP-29] Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

【速读】: 该论文旨在揭示并纠正合成图像检索(Composed Image Retrieval, CIR)基准测试中存在的单模态捷径(unimodal shortcuts)问题:即大量查询可通过仅依赖参考图像或仅依赖文本修改单模态信息(而非真正多模态组合)解决,导致模型的多模态组合能力被高估。解决方案的关键在于实施两阶段审计:第一阶段通过跨模型分析识别出可被捷径解算的查询;第二阶段对剩余查询进行人工验证,筛除存在歧义编辑或目标不匹配等噪声的查询,最终构建一个干净的无捷径查询子集。在该子集上重新评估模型,才能真实反映其多模态组合能力,避免基准测试因混入捷径可解查询和噪声查询而过度乐观。

链接: https://arxiv.org/abs/2605.14787
作者: Matteo Attimonelli,Alessandro De Bellis,Aryo Pradipta Gema,Rohit Saxena,Monica Sekoyan,Wai-Chung Kwan,Claudio Pomo,Alessandro Suglia,Dietmar Jannach,Tommaso Di Noia,Pasquale Minervini
机构: Politecnico di Bari (巴里理工大学); Sapienza University of Rome (罗马萨皮恩扎大学); University of Edinburgh (爱丁堡大学); University of Klagenfurt (克拉根福大学); Miniml.AI (Miniml.AI)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

[NLP-30] Streaming Speech-to-Text Translation with a SpeechLLM

【速读】: 该论文试图解决现有语音到文本翻译系统中,基于语音大语言模型(SpeechLLM)的架构无法实现真正流式处理的问题:它们要么等待完整话语结束后才输出翻译,要么以固定时间间隔输出,导致延迟高,不适用于实时应用场景。解决方案的关键在于提出一种新型LLM架构,使模型不仅学习如何生成输出文本的token,还学习自主判断当前接收的音频是否足够支撑输出决策,从而实现自适应流式输出。系统利用输入语音与输出文本的自动对齐进行训练,在多种语言对的实验中,该方案在保持接近非流式基线翻译质量的同时,将延迟降低至仅1-2秒。

链接: https://arxiv.org/abs/2605.14766
作者: Titouan Parcollet,Shucong Zhang,Xianrui Zheng,Rogier C. van Dalen
机构: Samsung(三星), AI Center(人工智能中心) – Cambridge(剑桥), United Kingdom(英国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 9 pages of main text; 24 pages in total

点击查看摘要

Abstract:Normally, a system that translates speech into text consists of separate modules for speech recognition and text-to-text translation. Combining those tasks into a SpeechLLM promises to exploit paralinguistic information in the speech and to reduce cascaded errors. But existing SpeechLLM systems are slow since they do not work in a real streaming fashion: they wait for a complete utterance of audio before outputting a translation, or output tokens at fixed intervals, which is not suitable for real applications. This work proposes an LLM-based architecture for real streaming speech-to-text translation. The LLM learns not just to emit output tokens, but also to decide whether it has seen enough audio to do so. The system is trained using automatic alignments of the input speech and the output text. In experiments on different language pairs, the system achieves a translation quality close to the non-streaming baseline, but with a latency of only 1-2 seconds.

[NLP-31] Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music

【速读】: 该论文试图解决现有音乐生成模型主要基于西方音乐训练,因而难以有效生成非西方音乐(如波斯音乐)的问题,具体体现在波斯音乐独特的音调、调式系统(Dastgah)和节奏结构对模型构成的显著挑战。解决方案的关键在于:首次构建了一个包含超过900小时高质量音频样本、涵盖流行、传统和当代等多种子类型的大规模波斯歌曲数据集,并利用该数据集对现有最先进的生成音乐模型MusicGen进行微调(fine-tuning),从而使其能够学习并生成更符合波斯音乐风格惯例的作品。评估结果通过主观与客观指标(包括语义对齐程度)验证了该方法的有效性。

链接: https://arxiv.org/abs/2605.14765
作者: Mohammad Hossein Sameti,Diba Hadi Esfangereh,Sepehr Harfi Moridani,Leili Javidpour,Mahdieh Soleymani Baghshah
机构: Sharif University of Technology (谢里夫理工大学); Independent Researcher (独立研究员)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: 9 pages, 2 figures, 3 tables

点击查看摘要

Abstract:Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-quality audio samples across diverse sub-genres, including pop, traditional, and contemporary styles. This dataset captures the rich melodic and cultural diversity of Persian music and serves as the foundation for fine-tuning MusicGen, a state-of-the-art generative music model. We adapt MusicGen to this domain and evaluate its performance by utilizing subjective and objective metrics. To assess the semantic alignment between generated music and intended style tags, we report the proportion of relevant tags accurately reflected in the generated outputs. Our results demonstrate that the fine-tuned model produces compositions that more align with Persian stylistic conventions. This work introduces a new resource for generative music research and illustrates the adaptability of music generation models to underrepresented cultural and linguistic contexts.

[NLP-32] Non-linear Interventions on Large Language Models

【速读】: 该论文试图解决现有大语言模型(LLM)干预方法局限于线性干预(Linear Intervention)的问题,即无法处理沿非线性流形(non-linear manifold)编码的特征,且无法干预缺乏直接输出签名(direct output signature)的隐式特征(implicit features)。解决方案之关键有两点:一是提出一种通用的干预公式(general formulation of intervention),自然地扩展至非线性表示的特征(non-linearly represented features);二是配套设计了一个学习过程,使得干预能够应用于没有直接输出签名的隐式特征。通过在拒绝旁路转向(refusal bypass steering)任务上的验证,该方法通过干预控制拒绝行为的非线性特征,实现了比线性基线更精确的模型引导。

链接: https://arxiv.org/abs/2605.14749
作者: Sangwoo Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models (LLMs). However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non-linear manifolds beyond their reach. In this work, we introduce a general formulation of intervention that extends naturally to non-linearly represented features, together with a learning procedure that further enables intervention on implicit features lacking a direct output signature. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non-linear feature governing refusal.

[NLP-33] Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining ICML2026

【速读】: 该论文试图解决多模态大语言模型驱动的图形用户界面(GUI)代理的泛化能力受限问题,其核心挑战在于缺乏覆盖多样化真实应用场景的大规模训练数据,而现有数据集严重依赖昂贵的手动标注且局限于狭窄领域。解决方案的关键在于提出一个名为 Video2GUI 的全自动框架,该框架通过粗到细(coarse-to-fine)的过滤策略,从无标注的互联网视频中识别高质量 GUI 教程视频,并将其转换为结构化的代理交互轨迹。该框架应用于 5 亿个视频元数据条目,构建了包含 1200 万条交互轨迹、覆盖 1500 多个应用与网站的 WildGUI 数据集。基于该数据集预训练 Qwen2.5-VL 和 Mimo-VL 模型,在多个 GUI 接地(GUI grounding)和动作(action)基准上取得 5–20% 的持续提升,达到或超越当前顶尖水平。

链接: https://arxiv.org/abs/2605.14747
作者: Weimin Xiong,Shuhao Gu,Bowen Ye,Zihao Yue,Lei Li,Feifan Song,Sujian Li,Hao Tian
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at ICML 2026

点击查看摘要

Abstract:Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training Qwen2.5-VL and Mimo-VL on WildGUI yields consistent improvements of 5-20% across multiple GUI grounding and action benchmarks, matching or surpassing state-of-the-art performance. We will release both the WildGUI dataset and the Video2GUI pipeline to support future research of GUI agents.

[NLP-34] Mechanical Enforcement for LLM Governance:Evidence of Governance-Task Decoupling in Financial Decision Systems

【速读】: 该论文试图解决在受监管金融工作流中,大语言模型(LLM)因同时解释和执行自然语言政策而产生的委托-代理问题(principal-agent failure):模型输出可能表面合规但实际不合规,而现有评估仅关注任务准确性,无法量化决策理由(rationale)层面的政策合规性。解决方案的关键在于引入机械强制执行(mechanical enforcement)机制,即四种在模型解释循环之外运行的原语(primitives),通过架构分离(architectural separation)将明确决策从模型控制中移除,从而在任务准确性与治理质量之间实现解耦。实验表明,机械强制执行将无信息延迟率降低73%,决策理由信息含量提升一倍以上,任务准确率从MCC 0.43提升至0.88,且即使在结构压力下任务性能下降时仍能维持治理质量,验证了治理与任务评估是独立维度,准确性不足以作为受监管AI系统中治理的充分代理。

链接: https://arxiv.org/abs/2605.14744
作者: José Manuel de la Chica Rodríguez,Carlos Martí-González
机构: Santander AI Lab(桑坦德AI实验室); Grupo Santander(桑坦德集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models in regulated financial workflows are governed by natural-language policies that the same model interprets, creating a principal–agent failure: outputs can appear compliant without being compliant. Existing evaluation measures task accuracy but not whether governance constrains behaviour at the decision rationale level – where regulated decisions must be auditable. We introduce five governance metrics that quantify policy compliance at the rationale level and apply them in a synthetic banking domain to compare text-only governance against mechanical enforcement: four primitives operating outside the model’s interpretive loop. Under text-only governance, 27% of deferrals carry no decision-relevant information. Mechanical enforcement reduces this rate by 73%, more than doubles deferral information content, and raises task accuracy from MCC~ 0.43 to 0.88 . The improvement is driven by architectural separation: LLM-generated rationales under mechanical enforcement show comparable CDL to text-only governance – the gain comes from removing clear-cut decisions from the model’s control. A causal ablation confirms that each primitive is individually necessary. Our central finding is a governance-task decoupling: under structural stress, text-only governance degrades on both dimensions simultaneously, whereas mechanical enforcement preserves governance quality even as task performance drops. This implies that governance and task evaluation are distinct axes: accuracy is not a sufficient proxy for governance in regulated AI systems.

[NLP-35] Agent ifying Patient Dynamics within LLM s through Interacting with Clinical World Model

【速读】: 该论文试图解决大型语言模型(LLM)在ICU脓毒症管理中因缺乏行动条件患者动态(action-conditioned patient dynamics)而无法做出可靠序列治疗决策的问题。解决方案的关键在于引入一个基于学习的临床世界模型(Clinical World Model),该模型能够模拟患者在候选液体-血管加压药干预下的反应,并配合一个“提出-模拟-优化”(propose–simulate–refine)工作流程,在最终处方前进行模拟验证。此外,通过三阶段课程训练(患者动态监督微调、行为克隆和基于世界模型的智能体强化学习)来专门训练SepsisAgent,使其在与临床世界模型重复交互中掌握患者的演变规律,从而在无直接模拟器支持时仍能保持决策性能。

链接: https://arxiv.org/abs/2605.14723
作者: Minghao Wu,Yuting Yan,Zhenyang Cai,Ke Ji,Chuangsen Fang,Ziying Sheng,Xidong Wang,Rongsheng Wang,Hejia Zhang,Shuang Li,Benyou Wang,Hongyuan Zha
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Beijing Hospital (北京医院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sepsis management in the ICU requires sequential treatment decisions under rapidly evolving patient physiology. Although large language models (LLMs) encode broad clinical knowledge and can reason over guidelines, they are not inherently grounded in action-conditioned patient dynamics. We introduce SepsisAgent, a world model-augmented LLM agent for sepsis treatment recommendation. SepsisAgent uses a learned Clinical World Model to simulate patient responses under candidate fluid–vasopressor interventions, and follows a propose–simulate–refine workflow before committing to a prescription. We first show that world-model access alone yields inconsistent LLM decision performance, motivating agent-specific training. We then train SepsisAgent through a three-stage curriculum: patient-dynamics supervised fine-tuning, propose–simulate–refine behavior cloning, and world-model-based agentic reinforcement learning. On MIMIC-IV sepsis trajectories, SepsisAgent outperforms all traditional RL and LLM-based baselines in off-policy value while achieving the best safety profile under guideline adherence and unsafe-action metrics. Further analysis shows that repeated interaction with the Clinical World Model enables the agent to learn regularities in patient evolution, which remain useful even when simulator access is removed.

[NLP-36] IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation

【速读】: 该论文旨在解决现有帧条件化视觉-语言-动作(Vision-Language-Action, VLA)策略在机器人模仿学习中因忽略历史上下文而导致的块间冲突与执行不稳定性问题。具体而言,机器人演示数据具有多模态特性,相似的观测-语言输入可能对应不同的动作块(action chunks),而现有策略仅依赖当前观测与指令独立推断每个块,在部分可观测条件下会重新采样不同的短视意图,引发相邻重规划步骤间的意图冲突。解决方案的关键是提出 IntentVLA,一种历史条件化的 VLA 框架,它通过将近期视觉观测编码为紧凑的短视意图表示(short-horizon intent representation),并将该表示作为条件来生成后续动作块,从而在相邻步骤间保持意图一致性,提升滚动执行稳定性。

链接: https://arxiv.org/abs/2605.14712
作者: Shijie Lian,Bin Yu,Xiaopeng Lin,Zhaolong Shen,Laurence Tianruo Yang,Yurun Jin,Haishan Liu,Changti Wu,Hang Yuan,Cong Huang,Kai Chen
机构: HUST(华中科技大学); ZGCA(中关村学院); ZGCI(中关村创新研究院); HIT(哈尔滨工业大学); HKUST(GZ)(香港科技大学(广州)); BUAA(北京航空航天大学); ZZU(郑州大学); ECNU(华东师范大学); USTC(中国科学技术大学); DeepCybo(深度赛博)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Code can be found in this https URL

点击查看摘要

Abstract:Robot imitation data are often multimodal: similar visual-language observations may be followed by different action chunks because human demonstrators act with different short-horizon intents, task phases, or recent context. Existing frame-conditioned VLA policies infer each chunk from the current observation and instruction alone, so under partial observability they may resample different intents across adjacent replanning steps, leading to inter-chunk conflict and unstable execution. We introduce IntentVLA, a history-conditioned VLA framework that encodes recent visual observations into a compact short-horizon intent representation and uses it to condition chunk generation. We further introduce AliasBench, a 12-task ambiguity-aware benchmark on RoboTwin2 with matched training data and evaluation environments that isolate short-horizon observation aliasing. Across AliasBench, SimplerEnv, LIBERO, and RoboCasa, IntentVLA improves rollout stability and outperforms strong VLA baselines

[NLP-37] AI-assisted cultural heritage dissemination: Comparing NMT and glossary-augmented LLM translation in rock art documents

【速读】: 该论文试图解决文化遗产机构在术语密集领域(如岩画艺术)进行多语言翻译时,因预算和人员限制导致翻译质量下降,尤其是专有术语不准确、不一致会误导非专业人士并降低内容复用性的问题。解决方案的关键在于采用简单、操作可行且无需复杂模型端修改的干预措施,即对比三种英文机器翻译(MT)设置后,发现基于检索增强生成(RAG)的词汇表增强提示(Gemini-RAG)效果最优:它通过从维护的最小化术语资源中检索术语对来增强大语言模型(LLM)的提示,从而在不牺牲整体翻译质量的前提下,显著提升术语精确匹配准确率(81.4%),优于单纯基础提示的LLM(69.1%)和强神经机器翻译(NMT)基线DeepL(64.4%),表明该低开销方法能有效改善文化遗产翻译中的术语控制。

链接: https://arxiv.org/abs/2605.14679
作者: Vicent Briva-Iglesias,María Ferre-Fernández
机构: SALIS (应用语言与跨文化研究学院); CTTS (翻译与文本研究中心); ADAPT Centre (ADAPT中心); Dublin City University (都柏林城市大学); Universidad de Almería (阿尔梅里亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Cultural heritage institutions increasingly disseminate research and interpretive materials globally, but multilingual dissemination is constrained by limited budgets and staffing. In terminology-dense domains such as rock art, translation quality depends on accurate, consistent specialised terms, and small lexical errors can mislead non-specialists and reduce reuse. We compare three English MT setups for a Spanish academic rock art text, focusing on simple, operationally feasible interventions rather than complex model-side modifications: (1) DeepL as a strong NMT baseline, (2) Gemini-Simple (LLM with a basic prompt), and (3) Gemini-RAG (the same LLM with glossary-augmented prompting via term-pair retrieval). Using PEARMUT, we conduct a human evaluation via (i) multi-way Direct Assessment (0–100) and (ii) targeted terminology auditing with a restricted MQM taxonomy. Gemini-RAG yields the highest exact-match terminology accuracy (81.4%), versus Gemini-Simple (69.1%) and DeepL (64.4%), while preserving overall quality (mean DA 85.3 Gemini-RAG vs. 85.2 Gemini-Simple), outperforming DeepL (80.3). These results show that glossary-augmented prompting is a low-overhead way to improve terminology control in cultural-heritage translation if institutions maintain minimal terminology resources and lightweight evaluation procedures.

[NLP-38] Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution

【速读】: 该论文旨在解决大视觉语言模型(LVLMs)在语言先验主导弱或模糊视觉证据时产生的幻觉问题。现有对比解码方法通过比较原始图像与外部扰动视觉输入的预测来缓解幻觉,但此类参考可能引入离流形伪影并需要额外的前向计算。论文提出SIRA,一种无需训练的内部对比解码框架,其关键解决方案在于利用多模态Transformer的分阶段信息流,在同一LVLM内部构建反事实参考。具体而言,SIRA首先通过共享前缀阶段让图像和文本令牌交互,形成保留了提示解释、解码历史、位置结构和早期视觉对齐的多模态状态;随后在后期Transformer层分出一个反事实分支,在该分支中屏蔽对图像令牌位置的注意力,使分支保留共享多模态上下文但无法继续获取细粒度视觉证据,从而生成语言先验主导的内部参考用于令牌级对比解码。解码时,SIRA抑制那些即使缺乏晚期视觉访问仍然置信的令牌,而优先选择依赖于完整视觉通路才有优势的预测,最终在不依赖训练、外部验证器或扰动输入的情况下有效减少幻觉,同时保持描述覆盖范围并降低计算开销。

链接: https://arxiv.org/abs/2605.14621
作者: Tian Qin,Junzhe Chen,Yuqing Shi,Tianshu Zhang,Qiang Ju,Lijie Wen
机构: Tsinghua University(清华大学); The University of Sydney(悉尼大学); Stanford University(斯坦福大学); Baichuan AI(百川智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references can introduce off-manifold artifacts and require costly extra forward passes. We propose SIRA, a training-free internal contrastive decoding framework that constructs a counterfactual reference inside the same LVLM by exploiting the staged information flow of multimodal transformers. Instead of removing visual information from the input, SIRA first lets image and text tokens interact through a shared prefix, forming an aligned multimodal state that preserves prompt interpretation, decoding history, positional structure, and early visual grounding. It then forks a counterfactual branch in later transformer layers, where attention to image-token positions is masked. This branch retains the shared multimodal context but lacks continued access to fine-grained visual evidence, yielding a language-prior-dominated internal reference for token-level contrast. During decoding, SIRA suppresses tokens that remain strong without late visual access and favors predictions whose advantage depends on the full visual pathway. Experiments on POPE, CHAIR, and AMBER with Qwen2.5-VL and LLaVA-v1.5 show that SIRA consistently reduces hallucinations while preserving descriptive coverage and incurring lower overhead than two-pass contrastive decoding. SIRA requires no training, external verifier, or perturbed input, and applies to open-weight LVLMs with white-box inference access.

[NLP-39] SciPaths: Forecasting Pathways to Scientific Discovery

【速读】: 该论文试图解决现有 AI4Science 基准(benchmark)侧重于引文预测(citation prediction)、文献检索(literature retrieval)或想法生成(idea generation),而忽略了使科学进步成为可能的贡献之间依赖关系(dependencies)的问题。为此,论文引入了发现路径预测(discovery pathway forecasting)任务:给定一个目标科学贡献(target scientific contribution)和特定时间点可用的先前文献,要求(1)识别实现该贡献所需的使能贡献(enabling contributions);(2)将每个使能贡献在存在先前工作时与先前工作建立关联(prior-work grounding)。解决方案的关键是构建 SciPaths 基准,包含 262 条专家注释的金标准路径(gold pathways)和 2,444 条从机器学习和自然语言处理论文中构建的银标准路径(silver pathways),每条路径记录了使能贡献、角色(roles)、理由(rationales)以及先前工作关联或未映射决策(unmapped decisions)。评估表明,最佳模型在严格语义匹配(strict semantic matching)下 F1 仅为 0.189,核心方法依赖性(core methodological dependencies)最难恢复;当提供金标准使能贡献时,先前工作关联显著提高,证明分解质量(decomposition quality)是端到端路径恢复(end-to-end pathway recovery)的主要瓶颈。因此,SciPaths 将评估转向科学预测中缺失的能力:从目标贡献向后推理(reasoning backward)到使其可行的使能科学构建块(enabling scientific building blocks)和先前工作依赖关系(prior-work dependencies)。

链接: https://arxiv.org/abs/2605.14600
作者: Eric Chamoun,Yizhou Chi,Yulong Chen,Rui Cao,Zifeng Ding,Michalis Korakakis,Andreas Vlachos
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scientific progress depends on sequences of enabling contributions, yet existing AI4Science benchmarks largely focus on citation prediction, literature retrieval, or idea generation rather than the dependencies that make progress possible. In this paper, we introduce discovery pathway forecasting: given a target scientific contribution and the prior literature available at a specified time, the task is to (1) identify the enabling contributions required to realize it and (2) ground each in prior work when such prior work exists. We present SciPaths, a benchmark of 262 expert-annotated gold pathways and 2,444 silver pathways constructed from machine learning and natural language processing papers, where each pathway records enabling contributions, roles, rationales, and prior-work groundings or unmapped decisions. Evaluating frontier and open-weight language models, we find that the best model reaches only 0.189 F1 under strict semantic matching, with core methodological dependencies hardest to recover. Prior-work grounding improves substantially when gold enabling contributions are provided, showing that decomposition quality is a major bottleneck for end-to-end pathway recovery. SciPaths therefore shifts evaluation toward a missing capability in scientific forecasting: reasoning backward from a target contribution to the enabling scientific building blocks and prior-work dependencies that make it feasible.

[NLP-40] EndPrompt: Efficient Long-Context Extension via Terminal Anchoring

【速读】: 该论文试图解决大规模语言模型(Large Language Models)在扩展上下文窗口时,因需要在目标长度序列上训练而导致的二次内存和计算成本高昂、难以复现的问题。解决方案的核心是EndPrompt方法,其关键思想在于:通过仅使用短训练序列即可实现有效的上下文扩展,具体做法是将原始短上下文保留为完整的第一段,并附加一个简短的终端提示作为第二段,为其分配接近目标上下文长度的位置索引,从而在一个短物理序列中同时引入局部和长距离相对位置距离,同时保持训练文本的语义连续性;该方法基于旋转位置嵌入(Rotary Position Embedding)和伯恩斯坦不等式(Bernstein inequality)的理论分析,证明位置插值在注意力函数上施加了严格的平滑性约束,而共享的Transformer参数进一步抑制了未观察中间距离的不稳定外推,从而实现了从稀疏位置监督诱导长上下文泛化。

链接: https://arxiv.org/abs/2605.14589
作者: Han Tian,Luxuan Chen,Xinran Chen,Rui Kong,Fang Wang,Jiamin Chen,Jinman Zhao,Yuchen Li,Jiashu Zhao,Shuaiqiang Wang,Haoyi Xiong,Dawei Yin
机构: Nankai University (南开大学); Baidu Inc. (百度公司); Shanghai Jiao Tong University (上海交通大学); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective context extension using only short training sequences. The core insight is that exposing a model to long-range relative positional distances does not require constructing full-length inputs: we preserve the original short context as an intact first segment and append a brief terminal prompt as a second segment, assigning it positional indices near the target context length. This two-segment construction introduces both local and long-range relative distances within a short physical sequence while maintaining the semantic continuity of the training text–a property absent in chunk-based simulation approaches that split contiguous context. We provide a theoretical analysis grounded in Rotary Position Embedding and the Bernstein inequality, showing that position interpolation induces a rigorous smoothness constraint over the attention function, with shared Transformer parameters further suppressing unstable extrapolation to unobserved intermediate distances. Applied to LLaMA-family models extending the context window from 8K to 64K, EndPrompt achieves an average RULER score of 76.03 and the highest average on LongBench, surpassing LCEG (72.24), LongLoRA (72.95), and full-length fine-tuning (69.23) while requiring substantially less computation. These results demonstrate that long-context generalization can be induced from sparse positional supervision, challenging the prevailing assumption that dense long-sequence training is necessary for reliable context-window extension. The code is available at this https URL.

[NLP-41] Uncertainty Quantification for Large Language Diffusion Models

【速读】: 该论文旨在解决大语言扩散模型(LLDMs, Large Language Diffusion Models)在不确定性量化(UQ, Uncertainty Quantification)上的方法缺失问题。现有UQ方法或是基于自回归分解假设,或是依赖高成本重复采样,与LLDMs的并行高效推理范式根本性错位,导致其无法直接使用。解决方案的关键在于设计一组轻量级、零样本的不确定性信号,这些信号直接从LLDMs迭代去噪过程中提取,具体包括:利用中间代(intermediate generations)、token重掩码动态(token remasking dynamics)以及去噪复杂度(denoising complexity),并通过融合掩码扩散似然(masked diffusion likelihood)与基于轨迹的语义不相似性(trajectory-based semantic dissimilarity)来适配现有最优UQ方法。论文还理论证明了期望轨迹不相似性下界掩码扩散训练目标,从而为将其作为不确定性分数提供理论动机。该方法在三个任务、八个数据集和两个模型上的实验表明,它能在接近最强基于采样的基线性能的同时,降低高达100倍的计算开销,从而实现了LLDMs在快速推理与可靠幻觉检测上的兼顾。

链接: https://arxiv.org/abs/2605.14570
作者: Artem Vazhentsev,Vladislav Smirnov,David Li,Maxim Panov,Timothy Baldwin,Artem Shelmanov
机构: MBZUAI(穆罕默德·本·扎耶德人工智能大学); The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. However, existing UQ methods are fundamentally misaligned with this new paradigm: they assume autoregressive factorization or use expensive repeated sampling, negating the efficiency of LLDMs. In this work, we present the first systematic study of UQ for LLDMs and propose lightweight, zero-shot uncertainty signals derived from the iterative denoising process, leveraging intermediate generations, token remasking dynamics, and denoising complexity. We further adapt a state-of-the-art UQ method to LLDMs by combining masked diffusion likelihoods with trajectory-based semantic dissimilarity. We prove that expected trajectory dissimilarity lower bounds the masked diffusion training objective, which motivates its usage as an uncertainty score. Comprehensive experiments across three tasks, eight datasets, and two models show that our method achieves a great cost-performance trade-off: it approaches the strongest sampling-based baselines while incurring up to 100x lower computational overhead. Our work demonstrates that LLDMs can deliver both fast inference and reliable hallucination detection simultaneously.

[NLP-42] Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM -Judge Baselines

【速读】: 该论文旨在解决行为驱动开发(Behaviour-Driven Development, BDD)测试套件中因重复步骤子序列累积而导致的维护负担问题,即如何自动识别哪些重复子序列值得提取(extraction-worthy),并将每个子序列预映射到已有的三种重构模式(within-file Background、within-repo reusable-scenario invocation、cross-organisational shared higher-level step),同时量化这些机会在公共BDD生态中的普遍性。解决方案的关键在于:构建一个基于语义等价的子序列发现管道,利用Sentence-BERT(SBERT)、UMAP和HDBSCAN对Gherkin语料中所有连续步骤窗口进行 paraphrase-robust 聚类,从而将海量原始切片归并为重复模式;随后通过人工标注的200个分层样本训练一个XGBoost分类器,以自动判断每个模式的提取适宜性及对应的重构机制,该分类器在5折交叉验证中达到F1=0.891,显著优于规则基线和大型语言模型(LLM)评判员,从而实现了对BDD重构机会的自动化、可扩展的普查与分类。

链接: https://arxiv.org/abs/2605.14568
作者: Ali Hassaan Mughal,Noor Fatima,Muhammad Bilal
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 pages, 12 figures and tables, 58 references. Under review at Software Quality Journal (Springer). Reproduction package at this https URL (Apache-2.0). Upstream cukereuse corpus at this https URL

点击查看摘要

Abstract:Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences (“slices”) by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. Sentence-BERT (SBERT) / Uniform Manifold Approximation and Projection (UMAP) / Hierarchical Density-Based Clustering (HDBSCAN) recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An eXtreme Gradient Boosting (XGBoost) extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss’ kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p 1e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, or cross-organisational shared-step candidate. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring opportunities; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

[NLP-43] Remember Your Trace: Memory-Guided Long-Horizon Agent ic Framework for Consistent and Hierarchical Repository-Level Code Documentation

【速读】: 该论文旨在解决现有仓库级自动化代码文档生成方法中因独立处理组件而导致的冗余检索、描述冲突及输出缺乏层次结构的问题。其解决方案的关键在于提出了一个长时程智能体框架 MemDocAgent,该框架通过两个核心组件实现:一是“依赖感知遍历引导 (Dependency-Aware Traversal Guiding)”,它预先确定一个遵循依赖关系和粒度层次结构的遍历顺序,确保上下文一致性;二是“记忆引导的智能体交互 (Memory-Guided Agentic Interaction)”,其中智能体与共享记忆模块 RepoMemory 交互,通过读写和验证操作累积先前的工作痕迹,从而在单一集成上下文中生成覆盖整个仓库的结构化文档。

链接: https://arxiv.org/abs/2605.14563
作者: Suyoung Bae,Jaehoon Lee,Changkyu Choi,YunSeok Choi,Jee-Hyong Lee
机构: Sungkyunkwan University(成均馆大学); University of Oslo(奥斯陆大学)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and conflicting descriptions across documents while producing outputs that lack hierarchical structure. Therefore, we propose MemDocAgent, a long-horizon agentic framework that generates documentation within a single, integrated context spanning the entire repository. It combines two components: (i) Dependency-Aware Traversal Guiding that predetermines a traversal order respecting dependency and granularity hierarchies; (ii) Memory-Guided Agentic Interaction, in which the agent interacts with RepoMemory, a shared memory accumulating prior work traces through read, write, and verify operations. Through an in-depth multi-criteria evaluation, MemDocAgent achieves the best performance over both open and closed-source baselines and demonstrates practical applicability in real software development workflows.

[NLP-44] Resolving Action Bottleneck: Agent ic Reinforcement Learning Informed by Token-Level Energy

【速读】: 该论文试图解决的问题是:在智能体强化学习(agentic reinforcement learning)训练大型语言模型时,常用的策略梯度方法(如PPO和GRPO)对轨迹中每个token进行均匀的信用分配(uniform credit assignment),导致训练信号在token级别被错误分配——推理token(reasoning tokens)获得了与其实际贡献不成比例的信号,而关键的、仅占轨迹一小部分的动作token(action tokens)却得不到足够的强化信号。这一现象被定义为“动作瓶颈”(Action Bottleneck)。解决方案的关键在于提出一种极其简单的token重加权方法ActFocus:一方面降低推理token的梯度权重,另一方面引入基于能量的再分配机制(energy-based redistribution mechanism),进一步增加不确定性较高(即与奖励方差相关性更强)的动作token的权重。该方法不引入额外运行时或内存开销,通过聚焦于动作token的训练信号,显著提升了策略梯度方法的性能。

链接: https://arxiv.org/abs/2605.14558
作者: Langzhou He,Junyou Zhu,Yue Zhou,Zhengyao Gu,Junhua Liu,Wei-Chieh Huang,Henry Peng Zou,David Wipf,Philip S. Yu,Qitian Wu
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Potsdam Institute for Climate Impact Research (波茨坦气候影响研究所); Technical University of Berlin (柏林工业大学); University of Southern California (南加州大学); University of Hong Kong (香港大学); Broad Institute of MIT and Harvard (麻省理工学院和哈佛大学博德研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment. In this paper, we critically demonstrate that such uniform credit assignment largely misallocates token-level training signals. From an energy-based modeling perspective, we show that token-level training signals, quantified by their correlations with reward variance of different rollouts sampled from a given prompt, concentrate sharply on action tokens rather than reasoning tokens, even though action tokens account for only a small fraction of the trajectory. We refer to this phenomenon as the Action Bottleneck. Motivated by this observation, we propose an embarrassingly simple token reweighting approach, ActFocus, that downweights gradients on reasoning tokens, along with an additional energy-based redistribution mechanism that further increases the weights on action tokens with higher uncertainty. Across four environments and different model sizes, ActFocus consistently outperforms PPO and GRPO, yielding final-step gains of up to 65.2 and 63.7 percentage points, respectively, without any additional runtime or memory cost.

[NLP-45] Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

【速读】: 该论文试图解决基于可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR)在大语言模型推理训练中面临的稀疏二元奖励(sparse binary rewards)与弱信用分配(weak credit assignment)问题,这类问题导致优化信号模糊,并且未能充分利用失败轨迹(failed trajectories)中蕴含的有用信息。解决方案之关键在于提出了修正导向的策略优化(Correction-Oriented Policy Optimization, CIPO),这是一种简单且有效的扩展方法,能够在不依赖任何外部信号的前提下,将在线策略产生的失败轨迹转化为修正导向的监督信号(correction-oriented supervision),并通过联合优化来自模型自身失败尝试的修正样本与标准RLVR目标,同时提升学习效果和模型修正自身错误的能力,从而显著改善模型的内在推理能力。

链接: https://arxiv.org/abs/2605.14539
作者: Mengjie Ren,Jie Lou,Boxi Cao,Xueru Wen,Hongyu Lin,Xianpei Han,Le Sun,Xing Yu,Yaojie Lu
机构: Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所中文信息处理实验室); University of Chinese Academy of Sciences (中国科学院大学); Xiaohongshu Inc (小红书公司)
类目: Computation and Language (cs.CL)
备注: Work on progress

点击查看摘要

Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and underutilization of the useful information embedded in failed trajectories. To address this challenge, we propose Correction-Oriented Policy Optimization (CIPO), a simple and effective extension to RLVR that converts on-policy failed trajectories into correction-oriented supervision, without relying on any external signals. By jointly optimizing correction samples derived from the model’s own failed attempts together with the standard RLVR objective, CIPO improves learning effectiveness while explicitly enhancing the model’s ability to correct its own errors. Extensive experiments across 11 benchmarks spanning mathematical reasoning and code generation demonstrate that CIPO consistently and significantly outperforms strong baselines in both reasoning and correction performance. Moreover, CIPO yields stronger pass@K gains, indicating that it improves the model’s intrinsic reasoning capacity rather than merely redistributing probability mass over existing correct answers.

[NLP-46] Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

【速读】: 该论文试图解决自回归模型与扩散模型在语言生成中存在的根本性局限,包括效率-保真度悖论 (Efficiency-Fidelity Paradox)、不可逆误差传播 (Irreversibility Error Propagation)、优化可处理性 (Optimization Tractability) 及保真度 (Fidelity) 问题,这些局限源于轨迹奇异性 (trajectory singularity)、伴随状态消失 (adjoint state vanishing) 与梯度缺失 (gradient absence) 的组合效应。解决方案的关键在于:首先将语言生成重新表述为一个随机最优控制 (stochastic optimal control) 问题,通过近似求解Hamilton-Jacobi-Bellman (HJB) 方程来获得一个作为闭环控制器 (closed-loop controller) 的最优策略;为规避直接求解HJB偏微分方程的不可处理性,采用流匹配 (Flow Matching) 作为矫正潜在控制空间 (rectified latent control space) 中的最优轨迹求解器;最后利用带有全局积分算子 (Global Integral Operator) 的Manta-LM来近似全局向量场,从而在理论上同时实现高保真文本生成与高效低成本的并行采样。

链接: https://arxiv.org/abs/2605.14531
作者: ZiYi Dong,Yuliang Huang,Weijian Deng,Xiangyang Ji,Liang Lin,Pengxu Wei
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

[NLP-47] Dimension-Level Intent Fidelity Evaluation for Large Language Models : Evidence from Structured Prompt Ablation

【速读】: 该论文试图解决当前大型语言模型(LLM)评估中,整体(holistic)评分无法区分模型是否复制了用户请求的结构形式(structural form)与是否保留了用户的具体意图(specific intent)这一核心问题。解决方案的关键在于提出一个维度级的意图保真度(intent fidelity)评估框架,该框架通过结构化提示消融(structured prompt ablation)实验,针对每个语义维度(semantic dimension)分别测量结构恢复(structural recovery)和意图保真度,从而揭示出系统性的结构-保真度分裂(structural-fidelity split)。此外,该框架进一步采用公共-私有分解(public-private decomposition)来表征模型在不同上下文下的补偿能力与失败模式,并通过代理标注(proxy annotation)区分先验可推断性(prior inferability)与默认可恢复性(default recoverability),最后通过权重扰动实验(weight-perturbation experiment)验证了不同程度错配的影响,从而证明维度级意图保真度评估是整体评估的必要补充。

链接: https://arxiv.org/abs/2605.14517
作者: GAng Peng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. 30 tasks, 3 languages, 6 LLMs, 2,880 outputs; includes human evaluation and structured prompt ablation

点击查看摘要

Abstract:Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user’s request from whether it preserved the user’s specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structured prompt ablation study across 2,880 outputs spanning three languages, three task domains, and six LLMs, that separately measures structural recovery and intent fidelity for each semantic dimension. This framework reveals a systematic structural-fidelity split: among Chinese-language outputs with complete paired scores, 25.7% received perfect holistic alignment scores (GA=5) while exhibiting measurable dimensional intent deficits; among English-language outputs, this proportion rose to 58.6%. Human evaluation confirmed that these split-zone outputs represent genuine quality deficits and that dimensional fidelity scores track human judgements more reliably than holistic scores do. A public-private decomposition of 2,520 ablation cells characterises when models successfully compensate for missing intent and when they fail, while proxy annotation distinguishes prior inferability from default recoverability. A weight-perturbation experiment shows that moderate misalignment is typically absorbed, whereas severe dimensional inversion is consistently harmful. These findings demonstrate that dimension-level intent fidelity evaluation is a necessary complement to holistic assessment when evaluating LLM outputs for user-specific tasks.

[NLP-48] GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

【速读】: 该论文试图解决现有大语言模型(LLM)代理的记忆系统在群体交互场景下的根本性缺失问题。当前所有主流记忆系统和评估基准均局限于二元单用户设定(dyadic, single-user setup),而现实部署中代理常处于包含多个用户和频道的群组环境,导致群体记忆的三个关键特性——超越简单串联对话的群体动态(group dynamics)、基于说话者的信念追踪(speaker-grounded belief tracking)以及基于心智理论(Theory-of-Mind)的听众适应语言(audience-adapted language)——完全未被测量和优化。解决方案之关键在于提出了GroupMemBench基准,其核心包含两个创新:一是图基合成管道(graph-grounded synthesis pipeline),通过可控回复结构生成多参与者对话,并为每条消息绑定个性化用户画像和目标听众;二是一个对抗性查询管道(adversarial query pipeline),针对六类推理能力(多跳推理、知识更新、术语歧义、用户隐含推理、时间推理、拒绝回答)为每个提问者生成具有挑战性的现实查询,从而系统性地暴露现有记忆系统在群体记忆特征上的全面崩溃(最强系统平均准确率仅46.0%,其中知识更新类仅27.1%),并揭示出当前记忆提取过程会抹除群体记忆所依赖的结构和词汇特征。

链接: https://arxiv.org/abs/2605.14498
作者: Jingbo Yang,Kwei-Herng Lai,Xiaowen Wang,Shiyu Chang,Yaar Harari,Evgeniy Gabrilovich
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

[NLP-49] Cross-Linguistic Transcription and Phonological Representation in the Huìtóngguǎnxì Huáyíyìyǔ

【速读】: 该论文旨在解决一个核心问题:破解《会同馆华夷译语》(HHY)这一明代多语种词汇系列的内在转录原则,将其从一个被视为“孤立语言材料集合”的文本提升为一个“具有内部结构的连贯转录系统”,从而揭示其如何通过汉字系统性地表示非汉语口语形式。解决方案的关键在于两点:第一,将数字化后的HHY数据与同时期汉语音韵范畴进行对齐,并整合以往各语言部分的重建成果,构建一个统一的跨语言比较数据库;第二,通过区分“主转录”(Main Transcription, MT)和“补充转录”(Supplementary Transcription, ST),分析两者在八个语种部分中的跨语言规律——MT主要处理与汉语音节结构兼容的发音,而ST专门编码不合于汉语音系的语音特征。这种分析表明,HHY并非将汉语音韵直接投射到其他语言,而是以更灵活的方式使用汉语音韵范畴,形成了一套相对系统的语音近似方法,从而为历史音系学研究,尤其是缺乏文献记录的亚洲语言,提供了重要证据。

链接: https://arxiv.org/abs/2605.14480
作者: Ji-eun Kim
机构: Duksung Women’s University (德成女子大学)
类目: Computation and Language (cs.CL)
备注: 47 pages; 1 figure; 40 tables; SLE2019; under review

点击查看摘要

Abstract:Purpose: This study investigates the transcription principles underlying Huìtóngguǎnxì Huáyíyìyǔ (HHY), a series of multilingual glossaries compiled by the Ming government between the fifteenth and sixteenth centuries for interpreter training. The study treats HHY not as a collection of isolated language materials, but as a coherent multilingual transcription system representing spoken forms of non-Chinese languages through Chinese characters. Methods: A substantial portion of HHY was digitized and aligned with Chinese phonological categories. Previous reconstructions of individual language sections were critically reviewed and integrated into a unified comparative database. The analysis focuses on cross-linguistic regularities in Main Transcription (MT) and Supplementary Transcription (ST) across eight language sections. Results: MT generally represents sounds compatible with the Chinese syllable structure of the period, whereas ST mainly encodes phonetic features less compatible with Chinese phonology. The analysis further shows that Chinese phonological categories were used more flexibly in foreign-language transcription than previously assumed. HHY therefore functioned as a relatively systematic method of phonetic approximation rather than a direct projection of Chinese phonology onto non-Chinese languages. Conclusion: HHY can be analyzed as an internally structured transcription system rather than merely as a collection of glossaries. More broadly, the study demonstrates that historical transcription systems can provide valuable evidence for historical phonology, particularly for under-documented Asian languages with limited historical records. Comments: 47 pages; 1 figure; 40 tables; SLE2019; under review Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.14480 [cs.CL] (or arXiv:2605.14480v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.14480 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ji-Eun Kim [view email] [v1] Thu, 14 May 2026 07:21:18 UTC (441 KB)

[NLP-50] When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context

【速读】: 该论文试图解决检索增强代码生成(Retrieval-Augmented Code Generation)中仓库上下文的时间一致性问题,即从过时项目状态中检索到的代码片段是仅作为无害噪声,还是会主动诱导模型生成与当前代码库状态不兼容的代码。解决方案的关键在于:通过一项受控诊断研究,在五个 Python 仓库的生产辅助签名变更样本集上,对比仅当前、仅过时、无检索以及混合当前与过时检索四种条件下模型的补全行为,并在提示中隐藏提交新鲜度和期望的当前签名以消除偏差。研究发现,过时检索显著提高了模型对过时辅助函数引用的生成比例(Qwen2.5-Coder-7B-Instruct 达 88.2%,gpt-4.1-mini 达 76.5%),而无检索虽产生零过时引用但补全通过率极低,混合条件则表明添加有效当前上下文能大部分挽救过时失败。因此,时间有效性被确立为代码检索增强生成(Code RAG)鲁棒性的一个重要诊断变量,提示过时上下文不仅移除有用证据,还会主动偏置模型向过时仓库状态倾斜。

链接: https://arxiv.org/abs/2605.14478
作者: Haojun Weng,Qianqian Yang,Hao Fu,Haobin Pan,Xinwei Lv
机构: Independent Researcher, California, USA (独立研究员,美国加利福尼亚); Independent Researcher, Beijing, China (独立研究员,中国北京)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier)

点击查看摘要

Abstract:Context: Retrieval-augmented code generation relies on cross-file repository context, but retrieved snippets may come from obsolete project states. Objectives: We study whether temporally stale repository snippets act as harmless noise or actively induce current-state-incompatible code. Methods: We conduct a controlled diagnostic study on a curated 17-sample set of production-helper signature changes from five Python repositories. For each sample, we compare current-only, stale-only, no-retrieval, and mixed current/stale retrieval conditions under prompts that hide commit freshness and expected current signatures. Results: Under neutralized prompts, stale-only retrieval induces stale helper references on 15/17 Qwen2.5-Coder-7B-Instruct samples and 13/17 gpt-4.1-mini samples, corresponding to 88.2 and 76.5 percentage-point increases over current-only retrieval. No retrieval produces zero stale references but only 1/17 passing completions. The two models share 75.0% Jaccard overlap among stale-triggering samples, and mixed conditions show that adding valid current evidence largely rescues stale-only failures. Conclusion: Temporal validity of retrieved repository context is a distinct diagnostic variable for Code RAG robustness: stale context can actively bias models toward obsolete repository state rather than merely removing useful evidence. Comments: 31 pages, 2 tables. Submitted to Information and Software Technology (Elsevier) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) ACMclasses: D.2.5; D.2.7; I.2.7 Cite as: arXiv:2605.14478 [cs.SE] (or arXiv:2605.14478v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2605.14478 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Haojun Weng [view email] [v1] Thu, 14 May 2026 07:18:30 UTC (22 KB) Full-text links: Access Paper: View a PDF of the paper titled When Retrieval Hurts Code Completion: A Diagnostic Study of Stale Repository Context, by Haojun Weng and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-05 Change to browse by: cs cs.AI cs.CL References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-51] Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

【速读】: 该论文旨在解决检索增强生成(RAG)中“上下文合规性”(Context-Compliance Regime)问题,即当检索到的上下文与模型参数化知识冲突时,检索到的上下文会主导最终答案,而标准准确率指标无法揭示这种冲突下检索上下文对答案的因果影响。解决方案的关键在于引入上下文驱动分解(Context-Driven Decomposition, CDD),这是一种在推理时运行的信念分解探测机制,通过显式分解模型内部对检索上下文与参数化知识的信念差异,实现对检索冲突的干预控制。CDD作为干预手段,能够在对抗性设置(如TruthfulQA误解注入)中显著提升准确率(从15.0%提升至64.1%的因果敏感性),并揭示不同模型家族(如Gemini vs. Claude)在冲突解决机制上的差异,同时增强模型在时间漂移和噪声干扰下的鲁棒性,从而将上下文合规性确立为一个独立于检索质量或单一方法鲁棒性的结构化维度。

链接: https://arxiv.org/abs/2605.14473
作者: Yihang Chen,Pin Qian,Su Wang,Sipeng Zhang,Huan Xu,Shuhuai Lin,Xinpeng Wei
机构: Georgia Institute of Technology (佐治亚理工学院); Carnegie Mellon University (卡内基梅隆大学); University of California San Diego (加利福尼亚大学圣迭戈分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model’s parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross- model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families: CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus, but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake- injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines. Comments: 12 pages, 4 figures, 3 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14473 [cs.CL] (or arXiv:2605.14473v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.14473 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yihang Chen [view email] [v1] Thu, 14 May 2026 07:14:19 UTC (19 KB) Full-text links: Access Paper: View a PDF of the paper titled Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict, by Yihang Chen and 6 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-05 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[NLP-52] LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

【速读】: 该论文致力于解决AI代理(AI agents)从聊天界面转向读取私人数据、调用工具和执行多步骤工作流时,因上下文敏感的安全护栏失效所引发的实际部署危害问题——具体包括秘密泄露、未授权操作或合法工作受阻——而部署反馈通常稀疏、嘈杂且难以依赖重复微调。解决方案之关键在于提出了终身安全适应(LiSA, Lifelong Safety Adaptation)这一保守的策略归纳框架,其核心是通过结构化记忆改进固定基础护栏:将偶发失败转化为可重用的策略抽象以使稀疏报告能泛化,引入冲突感知的局部规则以防止混合标签场景下的过泛化,并应用基于后验下界的证据感知置信度门控,使记忆重用随累积证据而非经验准确性动态缩放。

链接: https://arxiv.org/abs/2605.14454
作者: Minbeom Kim,Lesly Miculicich,Bhavana Dalvi Mishra,Mihir Parmar,Phillip Wallis,Bharath Chandrasekhar,Kyomin Jung,Tomas Pfister,Long T. Le
机构: Google Cloud AI Research (谷歌云AI研究); Google (谷歌); Seoul National University (首尔国立大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 27 pages, 3 figures

点击查看摘要

Abstract:As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency–performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

[NLP-53] When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition

【速读】: 该论文旨在解决大语言模型(LLMs)幻觉检测中准确率、效率与分布漂移鲁棒性之间的权衡问题:黑盒一致性方法需多次推理,而单次白盒探测器虽高效但孤立处理答案表示,在领域迁移下性能严重下降。解决方案的关键在于提出QAOD(Question-Answer Orthogonal Decomposition)单次前向框架,通过将答案表示投影至与问题对齐方向正交的分量,从而抑制领域条件化的变异;进一步利用多样性惩罚Fisher评分选取有效层,以及Fisher重要性选取判别性神经元,并设计两种互补探测策略——联合探测(正交分量结合问题上下文)最大化域内判别能力,正交分量单独探测则保留领域无关的事实信号,实现跨域鲁棒迁移。

链接: https://arxiv.org/abs/2605.14449
作者: Siyang Yao,Erhu Feng,Yubin Xia
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation, often degrading sharply under domain shift. We propose QAOD (Question-Answer Orthogonal Decomposition), a single-pass framework that projects away the question-aligned direction from the answer representation to obtain a question-orthogonal component that suppresses domain-conditioned variation. To identify informative signals, QAOD further selects layers via diversity-penalized Fisher scoring and discriminative neurons via Fisher importance. To address both in-domain detection and cross-domain generalization, we design two complementary probing strategies: pairing the or thogonal component with question context yields a joint probe that maximizes in-domain discriminability, while using the orthogonal component alone preserves domain-agnostic factuality signals for robust transfer. QAOD’s joint probe achieves the best in-domain AUROC across all evaluated model-dataset pairs, while the orthogonal-only probe delivers the strongest OOD transfer, surpassing the best white-box baseline by up to 21% on BioASQ at under 25% of generation cost.

[NLP-54] A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

【速读】: 该论文旨在解决端到端自动语音识别(ASR)系统中词汇表大小(vocabulary size)这一关键超参数缺乏系统确定方法的问题。现有工具包(如ESPNet)使用固定词汇大小,但未提供其确定依据,而传统方法依赖经验或黑箱式代价函数。解决方案的核心在于:基于现有代价函数框架,通过对训练数据进行曲线拟合,并应用微积分中的一阶和二阶导数检验(first and second derivative tests)来形式化地估计最优词汇表大小。该方法在Librispeech语料库上的实验表明,通过导数检验选出的词汇表大小能有效提升ASR系统性能,其主要贡献是建立了一种系统化的词汇表大小选择策略。

链接: https://arxiv.org/abs/2605.14427
作者: Sunil Kumar Kopparapu
机构: TCS Research - Mumbai
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 8 pages, is an extension of the paper S. K. Kopparapu and A. Panda, A cost minimization approach to fix the vocabulary size in a tokenizer for an end-to-end ASR system, in Proceedings of the 2024 International Conference on Pattern Recognition, Kolkata, India, 2024

点击查看摘要

Abstract:In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used for training. The choice and, more importantly, the size of this vocabulary is a critical hyper-parameter in training end-to-end ASR systems. Tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, and Unigram Language Model (ULM) use the vocabulary size as an input hyper-parameter to generate the sub-words employed during ASR training. Popular toolkits like ESPNet provide a fixed vocabulary size in their training recipes, but there is little documentation or discussion in the literature regarding how these values are determined. Recent work [1] has formalized an approach to identify the vocabulary size best suited for end-to-end ASR, introducing a cost function framework that treats the tokenization process as a black box. In this paper, we build upon that foundation by curve fitting the training data and using the principle of first and second derivative tests in calculus to formally estimate the vocabulary size hyper-parameter. We demonstrate the utility and usefulness of our approach by applying it on a standard Librispeech corpus and show that the optimal choice of vocabulary size hyper-parameter improves the performance of the ASR. The main contribution of this paper in formalizing an approach to identify the vocabulary size best suited for training an end-to-end ASR system.

[NLP-55] SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades

【速读】: 该论文试图解决当前缺乏评估大型语言模型(LLM)驱动的代码智能体(coding agent)在连续包发布(package release)链式升级任务中表现的基准问题。现有基准侧重于孤立的单次问题修复,未能捕捉现实软件维护中变更被捆绑、发布并由后续版本继承的连续性特性。解决方案的关键在于设计了一个分而治之的合成流水线(divide-and-conquer synthesis pipeline),该流水线将每个版本过渡的发布说明(release notes)与代码差异(code diffs)对齐,自动生成基于实际代码变更、对智能体信息丰富且可实现的升级需求规范(upgrade requirements),从而构建了包含12条升级链、155个版本过渡和1,660个接地升级需求的SWE-Chain基准。

链接: https://arxiv.org/abs/2605.14415
作者: Man Ho Lam,Chaozheng Wang,Hange Liu,Jingyu Xiao,Haau-sing Li,Jen-tse Huang,Terry Yue Zhuo,Michael R. Lyu
机构: The Chinese University of Hong Kong (香港中文大学); Independent (独立研究机构); ELLIS (欧洲学习与智能系统实验室); Technical University of Darmstadt (达姆施塔特工业大学); Johns Hopkins University (约翰霍普金斯大学); Monash University (莫纳什大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of package releases, where changes are bundled, shipped, and inherited by subsequent versions. We present SWE-Chain, a benchmark for evaluating agents on chained release-level package upgrades, where each transition builds on the agent’s prior codebase. To produce upgrade specifications, we design a divide-and-conquer synthesis pipeline that aligns release notes with code diffs for each version transition, ensuring the requirements are grounded in actual code changes, informative to agents, and feasible to implement. SWE-Chain contains 12 upgrade chains across 9 real Python packages, with 155 version transitions and 1,660 grounded upgrade requirements. Across nine frontier agent-model configurations, agents achieve an average of 44.8% resolving, 65.4% precision, and 50.2% F1 under the Build+Fix regime, with Claude-Opus-4.7 (Claude Code) leading at 60.8% resolving, 80.6% precision, and 68.5% F1. These results show that SWE-Chain is both feasible and discriminative, and reveal that current agents still struggle to make correct upgrades across chained package releases without breaking existing functionality.

[NLP-56] Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

【速读】: 该论文试图解决在多语言大型语言模型(LLM)中,现有机器遗忘(Machine Unlearning)评估方法无法有效捕捉跨语言信息分布的问题,即先前的评估仅是对单语言评估协议的简单扩展,忽略了信息在多语言间的渗透与残留。解决方案的关键在于提出了两个新指标:知识可分离性得分(Knowledge Separability Score, KSS)和知识持久性得分(Knowledge Persistence Score, KPS),分别从整体遗忘质量和跨语言对一致性移除两个维度来衡量多语言机器遗忘(MMU)效果,从而为MMU提供更准确的评估视角并揭示其特有现象。

链接: https://arxiv.org/abs/2605.14404
作者: Kyomin Hwang,Hyeonjin Kim,Sangyeon Cho,Nojun Kwak
机构: GSCST, Seoul National University (首尔大学GSCST); AIIS, Seoul National University (首尔大学AIIS); Department of Artificial Intelligence, Chung-Ang University (中央大学人工智能系); Korean Surgical Researcher Foundation, Republic of Korea (大韩民国韩国外科研究员基金会)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prior MMU evaluations fail to capture such cross-linguistic distribution of information, being largely limited to direct extensions of per-language evaluation protocols. To this end, we propose two metrics to evaluate the information spread across languages: the Knowledge Separability Score (KSS) and the Knowledge Persistence Score (KPS). KSS measures the overall unlearning quality across multiple languages, while KPS more specifically aims to assess consistent removal of information among different language pairs. We evaluated various unlearning methods in the multilingual setting with these metrics and conducted comprehensive analyses. Through our investigation, we provide insights into unique phenomena exclusive to MMU and offer a new perspective on MMU evaluation.

[NLP-57] Agent ic Recommender System with Hierarchical Belief-State Memory

【速读】: 该论文试图解决现有记忆增强型大语言模型(LLM)代理在个性化推荐中采用扁平记忆表示(flat memory representation)而导致的短暂信号与稳定偏好的混淆问题,以及缺乏完整记忆生命周期演化机制的问题。解决方案的关键在于提出MARS(Memory-Augmented Agentic Recommender System)框架,将推荐任务视为部分可观察问题(partially observable problem),通过维护结构化信念状态(structured belief state)来逐步将噪声化的观察行为抽象为用户偏好的紧凑估计。MARS将信念状态组织为三层记忆结构:事件记忆(event memory)缓冲原始信号,偏好记忆(preference memory)维护带有显式强度和证据追踪的细粒度可变动词块,档案记忆(profile memory)将所有偏好蒸馏为连贯的自然语言叙事。同时,设计了一套由基于LLM的计划器自适应调度的完整生命周期,包含提取(extraction)、强化(reinforcement)、弱化(weakening)、合并(consolidation)、遗忘(forgetting)和再合成(resynthesis)六种操作,替代了传统的固定间隔启发式调度。实验证明,该方法在InstructRec基准的四个领域上,相较最强基线在HR@1和NDCG@10上分别平均提升26.4%和10.3%,且在动态演化场景中进一步提升了代理调度带来的收益。

链接: https://arxiv.org/abs/2605.14401
作者: Xiang Shen,Yuhang Zhou,Yifan Wu,Zhuokai Zhao,Siyu Lin,Lei Huang,Qianqian Zhong,Lizhu Zhang,Benyu Zhang,Xiangjun Fan,Hong Yan
机构: Meta
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 figures, 8 tables

点击查看摘要

Abstract:Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Augmented Agentic Recommender System), a framework that treats recommendation as a partially observable problem and maintains a structured belief state that progressively abstracts noisy behavioral observations into a compact estimate of user preferences. MARS organizes this belief state into three tiers: event memory buffers raw signals, preference memory maintains fine-grained mutable chunks with explicit strength and evidence tracking, and profile memory distills all preferences into a coherent natural language narrative. A complete lifecycle of six operations – extraction, reinforcement, weakening, consolidation, forgetting, and resynthesis – is adaptively scheduled by an LLM-based planner rather than fixed-interval heuristics. Experiments on four InstructRec benchmark domains show that \ours achieves state-of-the-art performance with average improvements of 26.4% in HR@1 and 10.3% in NDCG@10 over the strongest baselines with further gains from agentic scheduling in evolving settings.

[NLP-58] Nexus : An Agent ic Framework for Time Series Forecasting

【速读】: 该论文试图解决现实世界时间序列预测中缺乏对非结构化上下文信息(如新闻或事件)有效利用的问题,并克服专业时间序列基础模型(Time Series Foundation Models, TSFMs)仅依赖数值模式而忽视文本信号,以及大语言模型(LLMs)作为零样本预测器时性能不稳定的局限性。解决方案的关键是提出了一个名为Nexus的多智能体(multi-agent)预测框架,它将预测任务分解为多个专门阶段:分别隔离宏观与微观时间波动,在可用时整合上下文信息,最终合成预测结果。这种分解使得模型能够从季节性信号自适应地过渡到波动性、事件驱动的信息,而无需依赖外部统计锚点或单一提示,从而有效组织数值推理与语境推理,释放LLMs内在的强预测能力,并生成带有明确驱动因素的高质量推理轨迹。

链接: https://arxiv.org/abs/2605.14389
作者: Sarkar Snigdha Sarathi Das,Palash Goyal,Mihir Parmar,Nanyun Peng,Vishy Tirumalashetty,Chun-Liang Li,Rui Zhang,Jinsung Yoon,Tomas Pfister
机构: Google(谷歌); Pennsylvania State University(宾夕法尼亚州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 30 Pages, 3 figures, 5 Tables

点击查看摘要

Abstract:Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual signals. Conversely, while LLMs are emerging as zero-shot forecasters, their performance remains uneven across domains and contextual grounding. To bridge this gap, we introduce Nexus, a multi-agent forecasting framework that decomposes prediction into specialized stages: isolating macro-level and micro-level temporal fluctuations, and integrating contextual information when available before synthesizing a final forecast. This decomposition enables Nexus to adapt from seasonal signals to volatile, event-driven information without relying on external statistical anchors or monolithic prompting. We show that current-generation LLMs possess substantially stronger intrinsic forecasting ability than previously recognized, depending critically on how numerical and contextual reasoning are organized. Evaluated on data strictly succeeding LLM knowledge cutoffs spanning Zillow real estate metrics and volatile stock market equities, Nexus consistently matches or outperforms state-of-the-art TSFMs and strong LLM baselines. Beyond numerical accuracy, Nexus produces high-quality reasoning traces that explicitly show the fundamental drivers behind each forecast. Our results establish that real-world forecasting is an agentic reasoning problem extending well beyond only sequence modeling.

[NLP-59] NodeSynth: Socially Aligned Synthetic Data for AI Evaluation

【速读】: 该论文旨在解决生成式AI(Generative AI)模型评估中合成数据缺乏敏感领域所需的社会技术细微差别(sociotechnical nuance)的问题,即大规模合成查询(queries)无法有效揭示模型在现实高风险场景下的缺陷。解决方案的关键是NodeSynth,一种基于真实世界证据(evidence-grounded)的方法论,其核心是引入一个经过微调的分类生成器(Taxonomy Generator, TaG),该生成器锚定于真实世界证据,能够生成具有社会相关性的合成查询;通过细粒度的分类扩展(granular taxonomic expansion),NodeSynth显著提升了模型失败率,并在主流大语言模型(LLMs)和防护模型(guard models)上验证了其有效性。

链接: https://arxiv.org/abs/2605.14381
作者: Qazi Mamunur Rashid,Xuan Yang,Zhengzhe Yang,Yanzhou Pan,Erin van Liemt,Darlene Neal,Kshitij Pancholi,Jamila Smith-Loud
机构: Google Research(谷歌研究); Google(谷歌)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generates socially relevant synthetic queries by leveraging a fine-tuned taxonomy generator (TaG) anchored in real-world evidence. Evaluated against four mainstream LLMs (e.g., Claude 4.5 Haiku), NodeSynth elicited failure rates up to five times higher than human-authored benchmarks. Ablation studies confirm that our granular taxonomic expansion significantly drives these failure rates, while independent validation reveals critical deficiencies in prominent guard models (e.g., Llama-Guard-3). We open-source our end-to-end research prototype and datasets to enable scalable, high-stakes model evaluation and targeted safety interventions (this https URL).

[NLP-60] Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation

【速读】: 该论文旨在解决心理防御机制(Psychological Defense Mechanisms, PDMs)自动文本分类中因数据稀缺(data scarcity)和类别不平衡(class imbalance)导致的性能瓶颈问题,并指出单纯依赖生成式增强(generative augmentation)因缺乏心理学基础而效果有限。解决方案的关键在于提出一个上下文感知的合成增强框架(context-aware synthetic augmentation framework),并将其与一个混合分类模型(hybrid classification model)相结合;该混合模型整合了上下文语言表示(contextual language representations)、基本临床特征(basic clinical features)以及150个标注防御项(annotated defense items),并通过控制提示(prompting)中的定义质量来确保生成保真度和下游性能,最终在低资源场景(low-resource settings)下为心理防御机制分类建立了强基线。

链接: https://arxiv.org/abs/2605.14380
作者: Hoang-Thuy-Duong Vu,Quoc-Cuong Pham,Huy-Hieu Pham
机构: College of Engineering and Computer Science, VinUniversity, Hanoi, Vietnam (越南河内Vin大学工程与计算机科学学院); VinUni-Illinois Smart Health Center, VinUniversity, Hanoi, Vietnam (越南河内Vin大学VinUni-Illinois智能健康中心); Center for Innovations in Health Sciences, VinUniversity, Hanoi, Vietnam (越南河内Vin大学健康科学创新中心)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: this https URL.

[NLP-61] Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement

【速读】: 该论文试图解决连续扩散语言模型(continuous diffusion language models)在语言去噪和token恢复任务中表现落后于自回归Transformer的问题,其核心原因在于扩散过程应用在不利于语言离散特性恢复的连续空间。解决方案的关键在于:利用基于几何评分(geometry score)的代理指标,为预训练Transformer选择一个扩散友好的隐藏状态接口(hidden-state interface),用扩散桥(diffusion bridge)替换下层Transformer前缀,同时保留上层与原始语言模型头(LM head);通过直接重构选定层的隐藏状态而非token,避免了从连续表示到离散token的直接恢复。这种方法使得扩散替代位置的预测更为有效,并在匹配训练预算的诊断比较中提升了隐藏状态恢复性能。

链接: https://arxiv.org/abs/2605.14368
作者: Injin Kong,Hyoungjoon Lee,Yohan Jo
机构: Graduate School of Data Science, Seoul National University (数据科学研究生院, 首尔大学); Department of Biosystems & Biomaterials Science and Engineering, Seoul National University (生物系统与生物材料科学与工程系, 首尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained transformer. DiHAL scores layers with geometry-based proxies, selects a diffusion-friendly hidden-state interface, and replaces the lower transformer prefix with a diffusion bridge while retaining the upper layers and original LM head. By reconstructing the selected-layer hidden state rather than tokens, DiHAL avoids direct continuous-to-discrete recovery. Experiments on 8B-scale backbones show that the geometry score predicts effective shallow insertion layers under a fixed bridge-training protocol and that hidden-state recovery improves over continuous diffusion baselines in a diagnostic comparison matching the diffusion/recovery training budget. These results suggest that hidden-state geometry helps identify where diffusion-based replacement is feasible inside pretrained language models.

[NLP-62] Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax ACL2026

【速读】: 该论文试图解决将大语言模型扩展到低资源语言时存在的“对齐税”(alignment tax)问题,即目标语言能力提升往往以通用能力的灾难性遗忘为代价。其解决方案的关键在于提出一种基于语义空间的强化学习范式,具体通过组相对策略优化(Group Relative Policy Optimization, GRPO)对模型进行嵌入层级的语义奖励优化,替代传统的监督微调(Supervised Fine-Tuning, SFT)中基于token级别表面模仿的似然最大化目标。这一方法通过鼓励语义保持和灵活的实现方式,实现了对模型参数的受控更新,从而显著减少对预训练知识的破坏性干扰,在获取低资源语言能力的同时有效缓解对齐税。

链接: https://arxiv.org/abs/2605.14366
作者: Zeli Su,Ziyin Zhang,Zhou Liu,Xuexian Song,Zhankai Xu,Longfei Zheng,Xiaolu Zhang,Rong Fu,Guixian Xu,Wentao Zhang
机构: Minzu University of China(中央民族大学); Ant Group(蚂蚁集团); Shanghai Jiao Tong University(上海交通大学); University of Macau(澳门大学); Peking University(北京大学); Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所); Hainan International College, Minzu University of China(中央民族大学海南国际学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: ACL 2026 Findings

点击查看摘要

Abstract:Extending large language models (LLMs) to low-resource languages often incurs an “alignment tax”: improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enforces token-level surface imitation on narrow and biased data distributions. To address this limitation, we propose a semantic-space alignment paradigm powered by Group Relative Policy Optimization (GRPO), where the model is optimized using embedding-level semantic rewards rather than likelihood maximization. This objective encourages meaning preservation through flexible realizations, enabling controlled updates that reduce destructive interference with pretrained knowledge. We evaluate our approach on Tibetan-Chinese machine translation and Tibetan headline generation. Experiments show that our method acquires low-resource capabilities while markedly mitigating alignment tax, preserving general competence more effectively than SFT. Despite producing less rigid surface overlap, semantic RL yields higher semantic quality and preference in open-ended generation, and few-shot transfer results indicate that it learns more transferable and robust representations under limited supervision. Overall, our study demonstrates that reinforcement learning with semantic rewards provides a safer and more reliable pathway for inclusive low-resource language expansion.

[NLP-63] Herculean: An Agent ic Benchmark for Financial Intelligence

【速读】: 该论文旨在解决当前金融领域AI智能体评估中存在的核心问题:现有基准主要测试静态能力(如问答、检索、摘要、分类),无法衡量智能体能否可靠地执行完整的金融专业工作流(如交易、对冲、市场洞察、审计),而这些高 stakes 工作流要求长周期协调、状态一致性和结构化验证。解决方案的关键在于设计了第一个专门针对智能体金融智能的技能基准Herculean,它将四个代表性工作流实例化为基于MCP(Model Context Protocol)的标准化技能环境,每个环境配备专用工具、交互动态、约束和成功标准,从而实现对异构智能体系统的端到端一致性评估,并揭示了当前智能体在将金融推理转化为可靠工作流执行方面的显著差距。

链接: https://arxiv.org/abs/2605.14355
作者: Xueqing Peng,Zhuohan Xie,Yupeng Cao,Haohang Li,Lingfei Qian,Yan Wang,Vincent Jim Zhang,Huan He,Xuguang Ai,Linhai Ma,Ruoyu Xiang,Yueru He,Yi Han,Shuyao Wang,Yuqing Guo,Mingyang Jiang,Yilun Zhao,Youzhong Dong,Xiaoyu Wang,Yankai Chen,Ye Yuan,Qiyuan Zhang,Fuyuan Lyu,Haolun Wu,Yonghan Yang,Zichen Zhao,Yuyang Dai,Fan Zhang,Rania Elbadry,Ayesha Gull,Muhammad Usman Safder,Nuo Chen,Fengbin Zhu,Tianshi Cai,Zimu Wang,Polydoros Giannouris,Yuechen Jiang,Zhiwei Liu,Mohsinul Kabir,Yuyan Wang,Yixiang Zheng,Yangyang Yu,Weijin Liu,Wenbo Cao,Anke Xu,Peng Lu,Jerry Huang,Fengran Mo,Mingquan Lin,Prayag Tiwari,Yijia Zhao,Victor Gutierrez Basulto,Xiao-Yang Liu,Kaleb E Smith,Jiahuan Pei,Arman Cohan,Jimin Huang,Yuehua Tang,Alejandro Lopez-Lira,Xi Chen,Xue Liu,Junichi Tsujii,Jian-Yun Nie,Sophia Ananiadou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static competencies such as question answering, retrieval, summarization, and classification. We introduce Herculean, the first skilled benchmark for agentic financial intelligence spanning four representative workflows, including Trading, Hedging, Market Insights, and Auditing. Each workflow is instantiated as a standardized MCP-based skill environment with its own tools, interaction dynamics, constraints, and success criteria, enabling consistent end-to-end assessment of heterogeneous agent systems. Across frontier agents, we find agents perform relatively well on Trading and Market Insights, but struggle substantially on Hedging and Auditing, where long-horizon coordination, state consistency, and structured verification are critical. Overall, our results point to a key gap in current agents in turning financial reasoning into dependable workflow execution in high-stakes financial workflows.

[NLP-64] LLM -based Detection of Manipulative Political Narratives

【速读】: 该论文旨在解决社交媒体中操纵性政治叙事的自动检测与结构化难题,核心挑战在于区分操纵性叙事与合法批评,以及识别那些将真实事件重新框架化(reframe)以服务于操纵意图的帖子。解决方案的关键在于构建一个结合小样本提示(few-shot prompt)过滤与无监督聚类的混合框架:首先,通过融合已知运动叙事与合法批评的详细提示,驱动推理模型(reasoning model)对帖子进行标签分类,仅保留被判定为操纵性叙事的帖子;然后,对这些选中的帖子进行嵌入(embedding)与UMAP降维,再应用HDBSCAN算法进行无监督聚类,以发现不依赖预定义类别的、新的叙事簇;最后,再次利用推理模型解释每个簇背后的叙事逻辑。该方法在120万条社交媒体帖子上成功识别出41个不同的操纵性叙事簇,验证了提示过滤与无监督聚类集成方案的有效性。

链接: https://arxiv.org/abs/2605.14354
作者: Sinclair Schneider,Florian Steuber,Gabi Dreo Rodosek
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026)

点击查看摘要

Abstract:We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives and legitimate critiques. Some posts may also reframe actual events within a manipulative context. To achieve good clustering results, we filter manipulative posts beforehand using a detailed few-shot prompt that combines documented campaign narratives with legitimate criticisms to differentiate them. This prompt enables a reasoning model to assign labels, retaining only manipulative narrative posts for further processing. The remaining posts are subsequently embedded and dimensionality-reduced using UMAP, before HDBSCAN is applied to uncover narrative groups. A key advantage of this unsupervised approach is its independence from a predefined list of target categories, enabling it to uncover new narrative clusters. Finally, a reasoning model is employed to uncover the narrative behind each cluster. This approach, applied to over 1.2 million social media posts, effectively identified 41 distinct manipulative narrative clusters by integrating prompt-based filtering with unsupervised clustering. Comments: This paper has been submitted to the upcoming 18th International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2026) Subjects: Computation and Language (cs.CL) Cite as: arXiv:2605.14354 [cs.CL] (or arXiv:2605.14354v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.14354 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sinclair Schneider [view email] [v1] Thu, 14 May 2026 04:30:21 UTC (98 KB)

[NLP-65] Ideology Prediction of German Political Texts AAAI

【速读】: 该论文旨在解决如何将政治文本的倾向性投射到一个连续的左-右光谱(由归一化标量 d 表示,范围 -1 到 1)上,以超越传统多类分类器只能输出离散标签的限制,从而允许分析师灵活聚焦于特定政治谱系段落(例如排除自由主义或极右翼运动)。解决方案的关键在于构建一个基于 transformer 的模型架构,并为其准备多个领域特定的训练与测试语料库(包括德国联邦议院会议记录、Wahl-O-Mat 决策工具、33 种报纸文本以及 597 名议员的 535,200 条推文),通过双语料训练与双语料测试的交叉验证策略防止过拟合,最终发现模型架构和领域特定训练数据的可获得性与模型规模对政治偏见估计的影响同等重要。

链接: https://arxiv.org/abs/2605.14352
作者: Sinclair Schneider,Florian Steuber,Joao A. G. Schneider,Gabi Dreo Rodosek
机构: 未知
类目: Computation and Language (cs.CL)
备注: This paper has been accepted for the upcoming 20th International AAAI Conference on Web and Social Media (ICWSM 2026)

点击查看摘要

Abstract:Elections represent a crucial milestone in a nation’s ongoing development. To better understand the political rhetoric from various movements, ranging from left to right, we propose a transformer-based model capable of projecting the political orientation of a text on a continuous left-to-right spectrum, represented by a normalized scalar d between -1 and 1. This approach enables analysts to focus on specific segments of the political landscape, such as conservatives, while excluding liberal and far-right movements. Such a task can only be achieved with multiclass classifiers, provided that the desired orientation is incorporated within one of their predefined classes. To determine the most suitable foundation model among 13 candidate transformers for this task, we constructed four distinct corpora. One corpus comprised annotated plenary notes from the German Bundestag, while another was based on an official online decision-making tool, Wahl-O-Mat. The third corpus consisted of articles from 33 newspapers, each identified by its political orientation, and the fourth included 535,200 tweets from 597 members of the 20th and 21st German Bundestag. To mitigate overfitting, we used two distinct corpora for training and two for testing, respectively. For in-domain performance, DeBERTa-large achieved the highest F1 score F1=0.844 as well as for the X (Twitter) out-of-domain test ACC=0.864. Regarding the newspaper out-of-domain test, Gemma2-2B excelled (MAE = 0.172). This study demonstrates that transformer models can recognize political framing in German news at the level of public opinion polls. Our findings suggest that both the model architecture and the availability of domain-specific training data can be as influential as model size for estimating political bias. We discuss methodological limitations and outline directions for improving the robustness of bias measurement.

[NLP-66] Dynamic Latent Routing

【速读】: 该论文主要解决在低数据微调场景下,如何通过时间依赖的子策略组合来学习全局最优策略的问题,并针对语言模型后训练提出一种无需额外监督信号的高效离散隐含结构与路由策略学习方法。解决方案的关键在于:基于General Dijkstra Search (GDS)的“search, select, update”原则,提出的Dynamic Latent Routing (DLR)方法,能够在单一训练阶段中联合学习离散潜在编码、路由策略和模型参数,通过动态搜索实现最优子策略的时序组合,从而在低资源微调中匹配或超越传统监督微调(SFT)的表现,平均提升6.6个百分点。

链接: https://arxiv.org/abs/2605.14323
作者: Fangyuan Yu,Xin Su,Amir Abdullah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal sub-policies. Motivated by the “search, select, update” principle underlying GDS, we propose Dynamic Latent Routing (DLR), a language-model post-training method that jointly learns discrete latent codes, routing policies, and model parameters through dynamic search in a single training stage. In low-data fine-tuning settings, DLR matches or outperforms supervised fine-tuning across four datasets and six models, achieving a mean gain of +6.6 percentage points, while prior discrete-latent baselines consistently underperform SFT. Mechanistic analyses and targeted code ablations show that DLR learns structured routing behaviors with distinct causal roles.

[NLP-67] Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

【速读】: 该论文旨在解决标准离散扩散语言模型(Discrete Diffusion Language Models)中采用独立词元预测(independent token-wise prediction)近似干净后验分布(clean token posterior)时引入的因子化误差(factorization errors),该误差破坏了词元间的依赖关系,影响了生成质量。解决方案的关键是提出了一种无因子化误差的离散扩散语言建模方法(FeF-DLLM),它用精确的前缀条件化因子分解(exact prefix-conditioned factorization)替代独立预测,从而更准确地保持词元依赖;同时,为了缓解前缀条件化带来的顺序计算成本,FeF-DLLM 在扩散去噪过程中引入投机解码(speculative decoding),在保持并行预测和重掩码性质的前提下加速推理。理论上证明了该方法能从真实联合分布(true joint distribution)生成,并推导了期望加速比;实验表明,该方法在多个基准上平均准确率提升5.04个百分点,推理速度提升3.86倍。

链接: https://arxiv.org/abs/2605.14305
作者: Xun Fang,Yunchen Li,Hang Yuan,Zhou Yu
机构: East China Normal University (华东师范大学); Beijing Zhongguancun Academy (北京中关村学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard X_0 prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free Discrete Diffusion Language Modeling (FeF-DLLM), which replaces independent clean-token prediction with an exact prefix-conditioned factorization of the clean posterior to better preserve token dependencies. To reduce the sequential cost introduced by prefix conditioning, FeF-DLLM further incorporates speculative decoding within diffusion denoising, accelerating inference while maintaining the parallel prediction and re-masking properties of DLLMs. Theoretically, we prove that FeF-DLLM generates from the true joint distribution and derive its expected acceleration ratio. Experiments on GSM8K, MATH, HumanEval, and MBPP demonstrate that our method improves accuracy by an average of 5.04 percentage points while achieving an average inference speedup of 3.86\times .

[NLP-68] Minimal-Intervention KV Retention: A Design-Space Study and a Diversity-Penalty Survivor

【速读】: 该论文试图解决小预算(small budgets)下KV缓存压缩(KV-cache compression)的设计空间问题:现有方案在缓存表示(cache representation)、头部路由(head-wise routing)、压缩节奏(compression cadence)、解码行为(decoding behavior)和预算内评分(within-budget scoring)五个家族共七种机制均被验证无效。解决方案之关键在于提出一种对TriAttention(TriAttention)保留评分器(retention scorer)的单函数修改——α,其核心是用基于设施选址贪心选择(greedy facility-location-inspired selection)替代原有的argmax-top-k,并在V空间(V-space)中引入由单一权重λ控制的冗余惩罚(redundancy penalty)。通过预注册协议(pre-registered protocol)在冻结的开发集上调优λ,并在独立保留集上确认,发现λ=0.5时α在四个(模型,预算)单元格中的两个(Qwen b=128和Llama b=64)通过了Bonferroni校正,且无显著负向结果。该发现的非对称性表明:在此场景下,最小化的评分修改胜过了更重的结构重设计,而匹配内存、sympy评分和保留确认协议是使该非对称性可见的证据标准。

链接: https://arxiv.org/abs/2605.14292
作者: Libo Sun,Po-wei Harn,Peixiong He,Xiao Qin
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 12 pages, 2 figures, 3 tables. Code and data: this https URL

点击查看摘要

Abstract:KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\citehendrycks2021math) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\citedeepseek2025r1) at budgets b \in \64, 128\ . All seven were rejected. We then propose \alpha , a one-function modification to the TriAttention~\citemao2026triattention retention scorer that replaces argmax-top- k with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight \lambda . A pre-registered protocol tunes \lambda on a frozen development split and confirms on a disjoint held-out split; with \lambda = 0.5 , \alpha clears Bonferroni on two of the four (model, budget) cells (Qwen b=128 and Llama b=64 ), no cell is significantly negative, and the pre-registered Branch~A triggers. The finding is asymmetric: a minimal scoring modification beat heavier structural redesigns in this regime, and the combined matched-memory, sympy-graded, held-out confirmation protocol is the evidence standard that made the asymmetry visible.

[NLP-69] o See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)快速发展过程中,多模态网络数据被未经授权抓取并用于模型微调所引发的严重版权和隐私风险问题。现有防御手段(如机器遗忘和数字水印)均为事后方案,只能在侵权行为发生后采取补救措施。论文提出的解决方案MMGuard的核心机制在于:通过生成不可学习示例(unlearnable examples),向多模态数据中注入人眼不可察觉的扰动,主动利用LVLM的学习动态。具体地,通过最小化训练损失,扰动形成优化捷径,迫使模型在微调过程中过度拟合噪声;当推理阶段扰动缺失时,模型在下游任务上的性能会显著下降。为进一步强化防御,MMGuard引入了跨模态绑定破坏(cross-modal binding disruption),策略性地转移LVLM的注意力,在噪声与训练目标之间强制建立虚假相关性,并提供理论保证。此外,结合集成学习策略增强不同模型间的迁移性,MMGuard在白盒、灰盒和黑盒威胁模型下均能实现有效、隐蔽且鲁棒的防御,从而在机制上建立起对攻击性微调利用的主动防御优势。

链接: https://arxiv.org/abs/2605.14291
作者: Chengshuai Zhao,Zhen Tan,Dawei Li,Zhiyuan Yu,Huan Liu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc approaches that act only after intellectual property infringement has already occurred. In this work, we propose MMGuard to empower data owners to proactively protect their multimodal data against unauthorized LVLM fine-tuning. MMGuard generates unlearnable examples by injecting human-imperceptible perturbations that actively exploit the learning dynamics of LVLMs. By minimizing the training loss, the perturbation creates an optimization shortcut, causing the model to overfit to the noise and thereby degrading downstream performance when the perturbation is absent during inference. To further strengthen this defense, MMGuard introduces a cross-modal binding disruption, strategically shifting LVLM attention to enforce a spurious correlation between the noise and the training target with theoretical guarantees. Enhanced by an ensemble learning strategy for cross-model transferability, MMGuard is evaluated against nine open-source LVLMs across six datasets. Our comprehensive results demonstrate effective, stealthy, and robust protection under white-box, gray-box, and black-box threat models, establishing a mechanistic advantage in proactively defending against aggressive fine-tuning exploitation.

[NLP-70] Web Agents Should Adopt the Plan-Then-Execute Paradigm

【速读】: 该论文试图解决当前大语言模型(LLM)代理在Web任务中普遍采用ReAct架构(ReAct)所带来的提示注入(prompt injection)与控制流安全风险问题。解决方案的关键在于将默认范式从“推理-行动”切换为“先规划后执行”(plan-then-execute):即代理在观察运行时Web内容之前,预先承诺一个任务特定的程序,然后执行该程序。这一设计通过将不可信的Web内容隔离在预定义执行图的内部,禁止其重新定义用户任务或触发模型在运行时合成新动作,从而从根本上阻断提示注入攻击。实现这一方案的核心挑战在于Web工具缺乏类型化接口(typed interfaces),导致点击、输入等底层操作的语义依赖于页面状态,使得规划成为“近视”行为;因此,解决之道在于构建能够将网站交互抽象为任务级操作(task-level operations)的类型化、完备且可审计的API——这是一个基础设施问题,而非建模问题。

链接: https://arxiv.org/abs/2605.14290
作者: Julien Piet,Annabella Chow,Yiwei Hou,Muxi Lyu,Sylvie Venuto,Jinhao Zhu,Raluca Ada Popa,David Wagner
机构: UC Berkeley (加州大学伯克利分校)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then execute it. The reason is that web content mixes inputs from many parties. An e-commerce product page may combine a seller’s listing, customer reviews and sponsored advertisements. Under ReAct, all of this content flows into the model when deciding on the next action, creating a direct path for prompt injections to steer the agent’s control flow. Plan-then-execute changes this boundary: untrusted data may influence values or branches inside a predefined execution graph, but it cannot redefine the user task or cause the model to synthesize new actions at runtime. We analyze WebArena, a popular web agent benchmark, and find that all tasks are compatible with plan-then-execute, while 80% can be completed with a purely programmatic plan, without any runtime LLM subroutine. We identify the main barrier to adopting plan-then-execute on the web: For it to work well, tools must map cleanly to semantic actions, with effects known before execution, so agents have enough information to plan. The web does not naturally expose that interface. Browser tools such as click, type, and scroll have page-dependent meanings. Planning at this layer is near-sighted: the agent can only see actions on the current page, and later actions appear only after it acts. Closing this gap requires typed interfaces that turn website interactions from clicks and keystrokes to task-level operations. This is an infrastructure problem, not a modeling problem. Web tasks do not need reactivity by default; they need typed, complete, auditable website APIs.

[NLP-71] MetaMoE: Diversity-Aware Proxy Selection for Privacy-Preserving Mixture-of-Experts Unification ICML2026

【速读】: 该论文旨在解决隐私约束下分布式数据无法共享时,混合专家模型(Mixture-of-Experts, MoE)的统一训练问题。现有MoE方法假设训练数据可集中访问,但实际中数据分散于客户端且受隐私保护无法共享。解决方案的关键在于提出MetaMoE框架,通过公开代理数据(public proxy data)作为不可访问私有数据的替代,核心是多样性感知的代理选择(diversity-aware proxy selection),从公开数据中选取与客户端领域相关且多样化的样本,以有效近似私有数据分布并监督路由器学习。此外,代理数据还被用于对齐专家训练以提升统一时的专家协调能力,同时上下文感知路由器(context-aware router)增强对异构输入的专家选择。

链接: https://arxiv.org/abs/2605.14289
作者: Weisen Jiang,Shuhao Chen,Sinno Jialin Pan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models scale capacity by combining specialized experts, but most existing approaches assume centralized access to training data. In practice, data are distributed across clients and cannot be shared due to privacy constraints, making unified MoE training challenging. We propose MetaMoE, a privacy-preserving framework that unifies independently trained, domain-specialized experts into a single MoE using public proxy data as surrogates for inaccessible private data. Central to MetaMoE is diversity-aware proxy selection, which selects client-domain-relevant and diverse samples from public data to effectively approximate private data distributions and supervise router learning. These proxies are further used to align expert training, improving expert coordination at unification time, while a context-aware router enhances expert selection across heterogeneous inputs. Experiments on computer vision and natural language processing benchmarks demonstrate that MetaMoE consistently outperforms recent privacy-preserving MoE unification methods. Code is available at this https URL.

[NLP-72] Auditing Agent Harness Safety

【速读】: 该论文试图解决现有LLM agent安全评估仅关注最终输出或终端状态,而忽略执行过程中轨迹级别的违规行为(例如未授权资源访问、上下文泄露等)的问题,并揭示安全风险在中间轨迹中普遍存在且未被现有基准捕获。解决方案之关键在于提出了HarnessAudit框架,通过审计完整执行轨迹来评估三个核心维度:边界合规性(boundary compliance)、执行保真度(execution fidelity)和系统稳定性(system stability),并配套构建了HarnessAudit-Bench基准,包含210个跨8个真实世界领域的任务,覆盖单智能体(single-agent)与多智能体(multi-agent)配置及嵌入式安全约束。该框架的核心设计是能够检测轨迹中积累的违规模式,并量化多agent协作下安全风险面的扩展,从而揭示执行框架(harness)设计是安全部署的上限。

链接: https://arxiv.org/abs/2605.14271
作者: Chengzhi Liu,Yichen Guo,Yepeng Liu,Yuzhe Yang,Qianqi Yan,Xuandong Zhao,Wenyue Hua,Sheng Liu,Sharon Li,Yuheng Bu,Xin Eric Wang
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 11 Pages, 8 Figures

点击查看摘要

Abstract:LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output-level evaluation cannot see these failures, yet most safety benchmarks score only final outputs or terminal states, even though many violations occur mid-trajectory rather than at termination. The central question is whether the harness respects user intent, permission boundaries, and information-flow constraints throughout execution. To address this gap, we propose HarnessAudit, a framework that audits full execution trajectories across boundary compliance, execution fidelity, and system stability, with a focus on multi-agent harnesses where these risks are most pronounced. We further introduce HarnessAudit-Bench, a benchmark of 210 tasks across eight real-world domains, instantiated in both single-agent and multi-agent configurations with embedded safety constraints. Evaluating ten harness configurations across frontier models and three multi-agent frameworks, we find that: (i) task completion is misaligned with safe execution, and violations accumulate with trajectory length; (ii) safety risks vary across domains, task types, and agent roles; (iii) most violations concentrate in resource access and inter-agent information transfer; and (iv) multi-agent collaboration expands the safety risk surface, while harness design sets the upper bound of safe deployment.

[NLP-73] Hypergraph Enterprise Agent ic Reason er over Heterogeneous Business Systems

【速读】: 该论文旨在解决将大型语言模型(LLMs)应用于异构企业系统时,在多跳(multi-hop)及n元(n-ary)推理中出现的幻觉(hallucinations)和失败问题,同时弥补现有范式(如GraphRAG、NL2SQL)在复杂环境中缺乏语义基础(semantic grounding)和可审计执行(auditable execution)的不足。解决方案之关键在于引入HEAR(企业智能体推理器),其核心是构建一个分层超图本体(Stratified Hypergraph Ontology):基础图层(Graph Layer)虚拟化具有溯源感知(provenance-aware)的数据接口,而超边层(Hyperedge Layer)则编码n元业务规则和程序协议(procedural protocols)。通过运行一个证据驱动的推理循环(evidence-driven reasoning loop),HEAR能够动态编排本体工具进行结构化多跳分析,无需重新训练LLM,从而在供应链任务(如订单履行阻塞根因分析)中达到94.7%的准确率,并利用程序化超边降低令牌成本、拓扑探索保证正确性,最终建立可扩展且可审计的企业智能基础。

链接: https://arxiv.org/abs/2605.14259
作者: Ling Wang,Songnan Liu,Jianan Wang,Cheng Cheng,Xin Liu,Yihan Zhu,Enyu Li,Yu Xiao,Jiangyong Xie,Duogong Yan,Jiangyi Chen
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

[NLP-74] What Makes Words Hard? Sakura at BEA 2026 Shared Task on Vocabulary Difficulty Prediction

【速读】: 该论文旨在解决词汇难度预测(vocabulary difficulty prediction)问题,并特别关注在保持高预测精度的同时提升模型的可解释性。解决方案的关键在于两方面:一是采用软目标损失函数(soft-target loss function)对大型语言模型(LLM)进行微调,构建高精度的黑盒模型(black-box model),在开放赛道中取得最佳结果(相关系数r=0.91);二是设计一个可解释模型(explainable model),在不显著牺牲性能的前提下(r=0.77),揭示影响词汇真实产出难度(production difficulty)的因素,并进一步分析了英国文化协会知识型词汇表(KVL)中难度受拼写难度或测试项目构建影响的机制。

链接: https://arxiv.org/abs/2605.14257
作者: Adam Nohejl,Xuanxin Wu,Yusuke Ide,Maria Angelica Riera Machin,Yi-Ning Chang,Hitomi Yanaka
机构: RIKEN(日本理化学研究所); The University of Osaka(大阪大学); Nara Institute of Science and Technology(奈良先端科学技术大学院大学); National Tsing Hua University(国立清华大学); The University of Tokyo(东京大学); Tohoku University(东北大学)
类目: Computation and Language (cs.CL)
备注: To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

点击查看摘要

Abstract:We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council’s Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at this https URL .

[NLP-75] Active Learners as Efficient PRP Rerankers

【速读】: 该论文试图解决在调用受限(call-constrained)场景下,基于成对比较(Pairwise Ranking Prompting, PRP)的大语言模型(LLM)重排序方法中存在的噪声、顺序敏感性和非传递性等固有缺陷,以及传统排序算法(如完全排序)因无法可靠截断为前K个结果而导致的性能下降问题。解决方案的关键在于将PRP重排序重新框架为从噪声成对比较中的主动学习(active learning from noisy pairwise comparisons),并提出一个噪声鲁棒框架,其中核心创新是一个单次LLM调用每对比较的随机方向预言机(randomized-direction oracle),该系统将位置偏差转换为零均值噪声,从而无需双向调用即可实现无偏聚合排序,同时主动排名器作为即插即用组件,在调用预算约束下显著提升NDCG@10指标。

链接: https://arxiv.org/abs/2605.14236
作者: Jeremías Figueiredo Paschmann,Juan Kaplan,Francisco Nattero Santiago Mauricio Barron Bucolo,Juan Wisznia,Luciano del Corro
机构: ELIAS Lab, Departamento de Ingeniería, Universidad de San Andrés (ELIAS实验室, 工程系, 圣安德烈斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 7 figures. Preprint

点击查看摘要

Abstract:Pairwise Ranking Prompting (PRP) elicits pairwise preference judgments from an LLM, which are then aggregated into a ranking, usually via classical sorting algorithms. However, judgments are noisy, order-sensitive, and sometimes intransitive, so sorting assumptions do not match the setting. Because sorting aims to recover a full permutation, truncating it to meet a call budget does not produce a dependable top-K. We thus reframe PRP reranking as active learning from noisy pairwise comparisons and show that active rankers are drop-in replacements that improve NDCG@10 per call in the call-constrained regime. Our noise-robust framework also introduces a randomized-direction oracle that uses a single LLM call per pair. This approach converts systematic position bias into zero-mean noise, enabling unbiased aggregate ranking without the cost of bidirectional calls.

[NLP-76] DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

【速读】: 该论文试图解决现有疾病轨迹预测模型因训练数据局限于单一医院或研究队列而无法反映真实临床环境复杂性的问题,导致预测准确性受限。解决方案的关键是开发了DT-Transformer基础模型,利用来自Mass General Brigham (MGB) 的5710万条结构化电子健康记录(EHR)数据,覆盖170万患者、11家医院及广泛门诊网络进行大规模训练与验证,从而在留出验证和前瞻性验证中实现高判别能力,例如在896个疾病类别中的下一事件预测取得了中位数AUC 0.871,证明以健康系统规模训练为路径能够构建适用于真实临床预测的基础模型。

链接: https://arxiv.org/abs/2605.14227
作者: Yunying Zhu,Andrew R Weckstein,Kueiyu Joshua Lin,Jie Yang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient’s trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

[NLP-77] Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

【速读】: 该论文旨在解决现代大语言模型强化学习(LLM RL)系统中,分段生成(rollout generation)与策略优化(policy optimization)因实现差异导致的训练-推理不匹配(Training-Inference Mismatch, TIM)问题,该问题会使同一序列在同一模型权重下产生不同的token概率,从而独立引发训练崩溃。解决方案的关键在于:通过提出一个零不匹配诊断环境(VeXact)来隔离TIM,量化微小数值分歧对训练稳定性的影响,并识别出一系列能够缓解TIM的补救措施,从而将TIM视为影响LLM RL稳定性的首要系统级扰动加以处理。

链接: https://arxiv.org/abs/2605.14220
作者: Tianle Zhong,Neiwen Ling,Yifan Pi,Zijun Wei,Tianshu Yu,Geoffrey Fox,Peng Wu,Xiao Yu
机构: ByteDance(字节跳动); The University of Virginia(弗吉尼亚大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

[NLP-78] PreFT: Prefill-only finetuning for efficient inference

【速读】: 该论文试图解决在服务多个用户特定的参数高效微调(PEFT)适配器时,由于预填充(prefill)与解码(decode)阶段吞吐量不匹配所导致的整体服务吞吐量低下的问题。其解决方案之关键在于提出预填充仅微调(PreFT),即仅在预填充阶段对输入token应用适配器,在解码阶段则丢弃适配器,从而避免解码阶段的多适配器性能瓶颈,显著提升吞吐量,并通过增加秩(rank)等方式补偿性能损失,实现更优的吞吐量-准确度权衡。

链接: https://arxiv.org/abs/2605.14217
作者: Andrew Lanpouthakoun,Aryaman Arora,Zhengxuan Wu,Dhruv Pai,Ben Keigwin,Dan Jurafsky,Christopher Potts
机构: Stanford University(斯坦福大学); Tilde Research(Tilde 研究机构)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ( 1.9\times the throughput when serving 512 adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

[NLP-79] GradShield: Alignment Preserving Finetuning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在微调后因接触显性或隐性有害数据而导致安全对齐被破坏的问题,即微调过程可能使模型产生不安全行为,即使数据看似良性也可能引发不对齐。解决方案的关键在于提出一种名为GradShield的过滤方法,通过为每个数据点计算“微调隐式有害性分数”(Finetuning Implicit Harmfulness Score, FIHS),并结合自适应阈值算法,在微调前识别并移除可能破坏对齐的有害数据,从而在保持模型效用性能的同时,将攻击成功率(Attack Success Rate, ASR)持续控制在6%以下。

链接: https://arxiv.org/abs/2605.14194
作者: Zhanhao Hu,Xiao Huang,Patrick Mendoza,Emad A. Alghamdi,Basel Alomair,Raluca Ada Popa,David Wagner
机构: University of California, Berkeley (加州大学伯克利分校); HUMAIN (HUMAIN); King Abdulaziz City for Science and Technology (阿卜杜勒阿齐兹国王科技城); University of Washington, Seattle (华盛顿大学西雅图分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model’s alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below 6% while preserving utility performance.

[NLP-80] Why Retrieval-Augmented Generation Fails: A Graph Perspective

【速读】: 该论文探讨了检索增强生成(RAG)尽管借助外部证据但仍会产生错误回答的根本原因,并通过模型内部机制研究揭示了失败的结构性特征。解决方案之关键在于利用电路追踪(circuit tracing)构建归因图(attribution graph),以建模信息在Transformer层间的流动路径,并发现正确预测与错误预测在推理深度、证据流分布和局部连接模式上存在系统性差异。基于这些图拓扑特征,论文进一步开发了错误检测框架,并通过强化问题约束的证据锚定(question-constrained evidence grounding)来重塑内部路由(internal routing),从而提高检索信息的整合效率并减少错误。

链接: https://arxiv.org/abs/2605.14192
作者: Kai Guo,Xinnan Dai,Zhibo Zhang,Nuohan Lin,Shenglai Zeng,Jie Ren,Haoyu Han,Jiliang Tang
机构: Michigan State University (密歇根州立大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model’s reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

[NLP-81] BOOKMARKS: Efficient Active Storyline Memory for Role-playing

【速读】: 该论文试图解决角色扮演智能体(RPAs)在长期交互中维持一致性时,现有基于循环总结(recurrent summarization)的记忆方法因压缩而丢失重要细节的问题。解决方案的关键在于提出一个基于搜索的记忆框架BOOKMARKS,它通过主动初始化、维护和更新与当前任务(例如角色行为)相关的书签(bookmarks),每个书签结构化为特定故事情节点上某个问题的答案;对于每个当前任务,BOOKMARKS选择可重用的现有书签或在故事情节起点初始化带有有用问题的新书签,然后将其同步到当前故事点并更新答案,实现未来轮次的高效重用,从而兼顾任务特定细节的主动捕获与计算开销的被动避免。

链接: https://arxiv.org/abs/2605.14169
作者: Letian Peng,Ziche Liu,Yiming Huang,Longfei Yun,Kun Zhou,Yupeng Hou,Jingbo Shang
机构: University of California, San Diego (加利福尼亚大学圣迭戈分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Memory systems are critical for role-playing agents (RPAs) to maintain long-horizon consistency. However, existing RPA memory methods (e.g., profiling) mainly rely on recurrent summarization, whose compression inevitably discards important details. To address this issue, we propose a search-based memory framework called BOOKMARKS, which actively initializes, maintains, and updates task-relevant pieces of bookmarks for the current task (e.g., character acting). A bookmark is structured as the answer to a question at a specific point in the storyline. For each current task, BOOKMARKS selects reusable existing bookmarks or initializes new ones (at storyline beginning) with useful questions. These bookmarks are then synchronized to the current story point, with their answers updated accordingly, so they can be efficiently reused in future grounding rounds. Compared with recurrent summarization, BOOKMARKS offers (1) active grounding for capturing task-specific details and (2) passive updating to avoid unnecessary computation. In implementation, BOOKMARKS supports concept, behavior, and state searches, each powered by an efficient synchronization method. BOOKMARKS significantly outperforms RPA memory baselines on 85 characters from 16 artifacts, demonstrating the effectiveness of search-based memory for RPAs.

[NLP-82] ROK-FORTRESS: Measuring the Effect of Geopolitical Transcreation for National Security and Public Safety

【速读】: 该论文旨在解决大型语言模型多语言安全评估中一个被忽视的关键问题:语言和地缘政治背景的交互作用如何影响模型在国家安全与公共安全(NSPS)风险下的行为,而现有仅通过翻译实现的基准测试(translation-only benchmarks)无法捕捉这种交互,且经验证据局限于少数语言对。解决方案之核心是引入一种称为“创译矩阵”(transcreation matrix)的方法论,通过系统性地交叉组合两个因素——(i)语言(英语 vs. 韩语)和(ii)地缘政治基础(美国 vs. 韩国的实体、机构和操作细节)——形成受控的实验条件,从而分离并量化语言与地缘政治语境各自对模型对抗性提示响应的独立及交互效应。每个对抗性提示还配有一个双重用途的良性对应提示以量化过度拒绝(over-refusal),响应则通过校准的LLM-as-a-judge面板依据专家制定的二元评分标准进行评分。该方法以英韩语言对和美韩地缘政治轴为例证,发现仅通过翻译的评估会遗漏语言作为风险信号与上下文交互所导致的抑制效应,且“创译矩阵”旨在推广至其他语言-文化对。

链接: https://arxiv.org/abs/2605.14152
作者: Michael S. Lee,Yash Maurya,Drew Rein,Bert Herring,Jonathan Nguyen,Kyungho Song,Udari Madhushani Sehwag,Jiyeon Cho,Kaustubh Deshpande,Yeongkyun Jang,Jiyeon Joo,Minn Seok Choi,Evi Fuelle,Christina Q Knight,Joseph Brandifino,Max Fenkell
机构: Scale AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
备注: 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at this https URL

点击查看摘要

Abstract:Safety evaluations for large language models (LLMs) increasingly target high-stakes National Security and Public Safety (NSPS) risks, yet multilingual safety is typically assessed through translation-only benchmarks that preserve the underlying scenario, and empirical evidence of how language and geopolitical context interact remains limited to a narrow set of language pairs. We introduce \emphROK-FORTRESS this https URL, a bilingual, culturally adversarial NSPS benchmark that uses the English–Korean language pair and U.S.–ROK geopolitical axis as a case study, separating the effects of language and geopolitical grounding via a \emphtranscreation matrix: adversarial intents are evaluated under controlled combinations of (i) English versus Korean language and (ii) U.S.\ versus Korean entities, institutions, and operational details. Each adversarial prompt is paired with a dual-use benign counterpart to quantify over-refusal. Model responses are then scored using calibrated LLM-as-a-judge panels, applying our expert-crafted, prompt-specific binary rubrics. Across a dual-track set of frontier and Korean-optimized models, we find a consistent suppression effect in Korean variants and substantial model-to-model variation in how geopolitical grounding interacts with language. In many models, Korean grounding mitigates the Korean language-driven suppression – with no model showing significant amplification in the other direction – indicating that, at least in the English–Korean case, safety behavior is shaped by language-as-risk signals and context interactions that translation-only evaluations miss. The transcreation matrix methodology is designed to generalize to other language–culture pairs. Comments: 16 pages main body + appendix (63 total), 5 main figures, 4 main tables; dataset at this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY) Cite as: arXiv:2605.14152 [cs.CL] (or arXiv:2605.14152v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2605.14152 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-83] Polar probe linearly decodes semantic structures from LLM s

【速读】: 这篇论文试图解决人工神经网络如何将概念绑定为复杂语义结构(complex semantic structures)的问题。解决问题的关键方案是提出一种简单的神经编码(neural code),其中实体间关系的存在和类型分别由其嵌入向量(embeddings)之间的距离和方向来表示。通过在大语言模型(LLMs)的层激活子空间中设计一个线性极坐标探针(Polar Probe),可以恢复真实的语义结构。

链接: https://arxiv.org/abs/2605.14125
作者: Pablo J. Diego-Simón,Pierre Orhan,Yair Lakretz,Jean-Rémi King
机构: LSCP(语言科学与心理语言学实验室), ENS(巴黎高等师范学院), PSL(巴黎文理研究大学), EHESS(社会科学高等研究院), CNRS(法国国家科学研究中心), Paris, France; Paris Brain Institute(巴黎大脑研究所), Paris, France; Meta AI, Paris, France
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:How do artificial neural networks bind concepts to form complex semantic structures? Here, we propose a simple neural code, whereby the existence and the type of relations between entities are represented by the distance and the direction between their embeddings, respectively. We test this hypothesis in a variety of Large Language Models (LLMs), each input with natural-language descriptions of minimalist tasks from five different domains: arithmetic, visual scenes, family trees, metro maps and social interactions. Results show that the true semantic structures can be linearly recovered with a Polar Probe targeting a subspace of LLMs’ layer activations. Second, this code emerges mostly in middle layers and improves with LLM performance. Third, these Polar Probes successfully generalize to new entities and relation types, but degrades with the size of the semantic structure. Finally, the quality of the polar representation correlates with the LLM’s ability to answer questions about the semantic structure. Together, these findings suggest that LLMs learn to build complex semantic structures by binding representations with a simple geometrical principle.

[NLP-84] Mini-JEPA Foundation Model Fleet Enables Agent ic Hydrologic Intelligence

【速读】: 该论文试图解决通用地理空间基础模型(geospatial foundation model, 如 Google AlphaEarth)在专业化水文信号(hydrologic signals)表征上的不足,以及此类通用模型因数据、成本和计算需求高而难以获取和部署的问题。解决方案的关键是提出 Mini-JEPAs——一组小型、传感器专用的联合嵌入预测架构(Joint Embedding Predictive Architecture, JEPA)基础模型,每个模型针对特定遥感传感器(如 Sentinel-2 光学、Sentinel-1 SAR、MODIS 热红外、多时相 Sentinel-2 物候以及地形-土壤堆栈)进行预训练,共享相同的 Vision Transformer 骨干网络、JEPA 训练范式与 64 维输出空间;同时引入一个路由大型语言模型(router LLM)作为智能代理,根据专门问题的模态需求自动选择最合适的传感器专用模型,通过这种“路由选择+多模型组合”的架构在保持计算高效的同时提升对水文变量的预测精度(如土壤湿度、干旱指数和降水量的 (\Delta R^2) 最高提升 0.031),并可在本地以中等计算资源运行。

链接: https://arxiv.org/abs/2605.14120
作者: Mashrekur Rahman
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Geospatial foundation models compress multispectral observations into dense embeddings increasingly used in natural-language environmental reasoning systems. A single planetary-scale model, e.g. Google AlphaEarth, handles broad characterization well but may compromise on specialized hydrologic signals. Such generalist models are also often inaccessible, expensive, and require large-scale compute. We propose Mini-JEPAs: a fleet of small sensor-specialized Joint Embedding Predictive Architecture (JEPA) foundation models consulted by a routing agent for specialized questions. We pretrained five 22M-parameter Mini-JEPAs sharing an identical Vision Transformer backbone, JEPA recipe, and 64-d output space, using Sentinel-2 optical, Sentinel-1 SAR, MODIS thermal, multi-temporal Sentinel-2 phenology, and a topography-soil stack. Each Mini-JEPA reconstructs the variable matched to its sensor, with cross-validated R^2 reaching 0.97 for elevation, 0.97 for temperature, and 0.81 for precipitation. The five manifolds differ in geometric structure, with global participation ratios from 8.9 to 20.2 and local intrinsic dimensionalities from 2.3 to 9.0. Joint topography-soil and phenology models add predictive value beyond AlphaEarth alone for soil moisture, aridity, and precipitation ( \Delta R^2 up to 0.031). A router LLM reads per-modality references and selects appropriate sensors with a perfect hit rate over a curated question set. In paired LLM-as-Judge evaluation, dual retrieval over AlphaEarth and the routed fleet outperforms AlphaEarth alone on physics-matched questions (Cohen’s d = 1.10 , p = 0.031 ). Locally-trained Mini-JEPAs can be operationalized for hydrologic intelligence with modest compute.

[NLP-85] Generative Floor Plan Design with LLM s via Reinforcement Learning with Verifiable Rewards ACL2026

【速读】: 该论文试图解决专业平面图设计中的核心问题:如何在满足房间之间期望连通性的同时,精确控制房间尺寸和面积等数值约束,并保持功能与美学质量。现有生成方法主要关注拓扑连接性,无法处理数值类约束。解决方案的关键在于提出一种基于文本的平面图生成方法:首先对真实平面图微调大型语言模型(LLM),然后应用带可验证奖励的强化学习(RLVR),通过奖励信号引导模型更严格地遵循拓扑和数值约束,同时抑制无效或重叠输出;此外,还设计了一套约束一致性度量来系统评估生成结果。该方案使LLM能够有效处理混合约束,在现实主义、兼容性和多样性指标上均优于现有方法,尤其在兼容性上实现了至少94%的相对降低。

链接: https://arxiv.org/abs/2605.14117
作者: Luis Lara,Aristides Milios,Zhi Hao Luo,Aditya Sharma,Ge Ya Luo,Christopher Beckham,Florian Golemo,Christopher Pal
机构: Mila – Quebec AI Institute (Mila – 魁北克人工智能研究所); Université de Montréal (蒙特利尔大学); Polytechnique Montréal (蒙特利尔理工学院); Canada CIFAR AI Chair (加拿大CIFAR人工智能主席)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:An AI system for professional floor plan design must precisely control room dimensions and areas while respecting the desired connectivity between rooms and maintaining functional and aesthetic quality. Existing generative approaches focus primarily on respecting the requested connectivity between rooms, but do not support generating floor plans that respect numerical constraints. We introduce a text-based floor plan generation approach that fine-tunes a large language model (LLM) on real plans and then applies reinforcement learning with verifiable rewards (RLVR) to improve adherence to topological and numerical constraints while discouraging invalid or overlapping outputs. Furthermore, we design a set of constraint adherence metrics to systematically measure how generated floor plans align with user-defined constraints. Our model generates floor plans that satisfy user-defined connectivity and numerical constraints and outperforms existing methods on Realism, Compatibility, and Diversity metrics. Across all tasks, our approach achieves at least a 94% relative reduction in Compatibility compared with existing methods. Our results demonstrate that LLMs can effectively handle constraints in this setting, suggesting broader applications for text-based generative modeling.

[NLP-86] When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

【速读】: 该论文试图解决生物医学检索增强大语言模型在处理冲突、误导或不完整证据时的可靠性不足问题,现有评估方法通常只关注有帮助上下文下的答案准确性,而忽略了证据内部存在矛盾时模型应如何应对。解决方案的关键是提出一种冲突感知的弃权分数(conflict-aware abstention score),它通过结合模型自身置信度与证据冲突检测器,使模型在遇到高冲突证据时能够主动选择弃权而非强行回答,从而在困难的证据条件下显著提升选择性准确性。

链接: https://arxiv.org/abs/2605.14115
作者: Yikun Han,Mengfei Lan,Halil Kilicoglu
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL)
备注: Accepted by BioNLP 2026

点击查看摘要

Abstract:Biomedical retrieval-augmented large language models (LLMs) often face evidence that is incomplete, misleading, or internally contradictory, yet evaluation usually emphasizes answer accuracy under helpful context rather than reliability under conflict. Using HealthContradict, we evaluate six open-weight LLMs under five controlled evidence conditions: no retrieved context, correct-only context, incorrect-only context, and two mixed conditions containing both correct and contradictory documents in opposite orders. In this conflicting-evidence order contrast, where the same two documents are both present and only their order is reversed, accuracy drops for every model and 11.4%–25.2% of predictions flip. To support abstention in these difficult cases, we also evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, this score improves selective accuracy over confidence-only, with mean gains of 7.2–33.4 points in incorrect-only (IC') and 3.6--14.4 points in incorrect-first conflicting (ICC’) conditions across 75%, 50%, and 25% coverage. These results show that conflicting biomedical evidence is both an uncertainty and robustness problem and motivate evaluation and abstention methods that explicitly account for evidence disagreement.

[NLP-87] Measuring and Mitigating Toxicity in Large Language Models : A Comprehensive Replication Study

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中从网络规模语料中吸收有害模式所导致的“有毒退化”(toxic degeneration)问题——即即便使用无害提示也可能触发有害输出,对实际部署构成风险。解决方案之关键在于DExperts(Decoding-time Experts),这是一种推理时(inference-time)缓解技术,通过在没有额外模型重训练(retraining)的情况下,在解码阶段引导生成过程来降低毒性。其核心机制是引入一个“专家”模型与原始模型联合解码,以抑制有害内容的出现。

链接: https://arxiv.org/abs/2605.14087
作者: Mokshit Surana,Archit Rathod,Akshaj Satishkumar
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs), when trained on web-scale corpora, inherently absorb toxic patterns from their training data. This leads to ``toxic degeneration’’ where even innocuous prompts can trigger harmful outputs. This phenomenon poses significant risks for real-world deployments. Thus, necessitating effective mitigation strategies that should maintain model utility while ensuring safety. In this comprehensive replication study, we evaluate the efficacy of \textbfDExperts (Decoding-time Experts), which is an inference-time mitigation technique that steers generation without requiring model retraining. We structured our research into three systematic phases: (1) establishing baseline toxicity measurements using \textbfRealToxicityPrompts on standard GPT-2 models; then (2) implementing and evaluating DExperts to mitigate explicit toxicity; and finally (3) stress-testing the method against implicit hate speech using the adversarial \textbfToxiGen dataset. Our empirical results confirm that while DExperts achieves near-perfect safety rates (100%) on explicit toxicity benchmarks, it exhibits brittleness against adversarial, implicit hate speech, with safety rates dropping to 98.5%. Furthermore, we quantify a critical trade-off. The method introduces a \sim 10x latency penalty (from 0.2s to 2.0s per generation), posing challenges for real-time deployment scenarios. This study contributes to the growing body of work on AI safety by highlighting the robustness gap between explicit and implicit toxicity mitigation. We emphasize the need for more sophisticated approaches that generalize across diverse hate speech patterns without prohibitive computational costs.

[NLP-88] CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

【速读】: 该论文试图解决代码代理(code agents)在完成长周期仓库状态推理与严格工具使用协议遵守时,其指令模型(Instruct)与思考模型(Thinking)能力互补但行为错位的问题:Instruct模型简洁且工具纪律性强,但缺乏深度规划与恢复能力;Thinking模型规划与恢复能力强,却因过度思考而损害效率。解决方案的关键在于提出一种无需训练的参数编辑方法CRANE(Constrained Reasoning Injection for Code Agents via Nullspace Editing),将思考模型与指令模型的参数差异视为候选推理编辑的有向池,通过幅度阈值化(magnitude thresholding)去噪、保守泰勒门(Conservative Taylor Gate)筛选出对推理迁移和工具使用保留均有益的方向,以及渐进S形投影(Graduated Sigmoidal Projection)抑制格式关键更新,从而将思考模型的推理优势注入指令主干网络,同时保持指令模型的工具纪律与效率。

链接: https://arxiv.org/abs/2605.14084
作者: Mingzhi Zhu,Michele Merler,Raju Pavuluri,Stacy Patterson
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院); IBM Research (IBM研究院)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

[NLP-89] Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity ICLR2026

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)中广泛使用余弦相似度(cosine similarity)作为层相关性(layer relevance)评估指标时存在的根本性问题,即余弦相似度无法准确反映移除某一层后模型实际性能的下降,可能导致对Transformer内部机制的误导性解释。解决方案的关键在于提出一个更稳健的替代指标——直接测量移除层后模型准确率的实际下降(actual drop in model accuracy resulting from the removal of a layer)。尽管该指标计算成本较高,但它能提供层重要性的准确量化,从而支持更有效的剪枝策略和轻量级模型构建。

链接: https://arxiv.org/abs/2605.14075
作者: Cristian Hinostroza,Rodrigo Toro Icarte,Christ Devia,Andres Carvallo De Ferari,Eugenio Herrera-Berg,Denis Parra,Jorge F Silva
机构: Pontificia Universidad Católica de Chile (智利天主教大学); National Center for Artificial Intelligence (CENIA) (国家人工智能中心); Universidad de Chile (智利大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Published at ICLR 2026

点击查看摘要

Abstract:Large language models (LLMs) have revolutionized natural language processing. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field. On this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model’s performance. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer’s internal mechanisms. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance.

[NLP-90] Distribution Corrected Offline Data Distillation for Large Language Models

【速读】: 该论文试图解决从强语言模型向小模型蒸馏推理轨迹时,现有方法面临的根本性权衡:离线蒸馏利用教师生成的高质量轨迹提供样本高效的监督,但会导致师生分布漂移(distributional drift),即学生模型在训练时条件于教师生成的前缀,而在推理时自回归于自身生成的前缀,造成长推理轨迹上的累积误差;而在线策略(on-policy)或自蒸馏方法虽能匹配学生推理时的分布,却需要昂贵的在线采样且早期生成低质量轨迹。解决方案的关键在于提出一种原则性的离线推理蒸馏框架(principled offline reasoning distillation framework),该框架在保留离线教师生成数据的效率与监督质量的同时,通过自适应地强调与学生在线策略分布(on-policy distribution)更一致的教师监督来纠正师生分布漂移,从而在不进行在线交互的情况下,利用分布校正感知训练(distribution-correction-aware training)显著提升离线推理蒸馏的准确性和稳定性。

链接: https://arxiv.org/abs/2605.14071
作者: Yumeng Zhang,Zhengbang Yang,Yevin Nikhel Goonatilake,Zhuangdi Zhu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Distilling reasoning traces from strong large language models into smaller ones is a promising route to improve intelligence in resource-constrained settings. Existing approaches face a fundamental trade-off: offline distillation from teacher-generated traces provides high-quality, sample-efficient supervision but suffers from distributional drift: during training, the student model conditions on teacher-generated prefixes, whereas during inference the student autoregresses on self-generated prefixes, leading to compounding errors over long reasoning trajectories. Meanwhile, on-policy or self-distillation methods better match the student’s inference-time distribution, but require costly online sampling and often produce low-quality traces in early training. We propose a principled offline reasoning distillation framework that preserves the efficiency and supervision quality of offline teacher-generated data while correcting teacher-student distribution drift. It adaptively emphasizes teacher supervision that is better aligned with the student’s on-policy distribution. Evaluations on mathematical reasoning benchmarks of GSM8K, MATH, MATH500, and harder held-out competition-style tasks, including AMC, AIME, and OlympiadBench, show that our method improves reasoning accuracy over prior offline distillation algorithms and yields more stable reasoning traces while preserving instruction-following capabilities. Our work shows that lightweight, distribution-correction-aware training can substantially strengthen offline reasoning distillation without online rollouts.

[NLP-91] Know When To Fold Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

【速读】: 该论文试图解决在使用大型语言模型(LLMs)进行合成数据生成的后训练管线中,传统方法先生成完整输出再进行质量过滤,导致大量token浪费在最终被丢弃的样本上的低效问题。解决方案的关键在于提出了多阶段飞行中拒绝(Multi-Stage In-Flight Rejection, MSIFR)框架,这是一种轻量级、无需训练的方法,通过将生成过程分解为多个顺序阶段,并在每个中间检查点(intermediate checkpoint)应用基于规则的快速验证器(rule-based validators)来检测算术不一致、幻觉模式和格式违规等缺陷,从而在生成未完成前就早期终止低质量的生成轨迹,实现早期拒绝(early rejection)。该框架将飞行中拒绝形式化为一个顺序决策过程,证明了任何非平凡的丢弃策略都能减少期望token消耗,且拒绝发生得越早,阶段节省越大;同时通过条件效用估计的鞅(martingale)性质,确保了早期拒绝不会对保留样本的期望效用产生偏差。实验表明,MSIFR在五个指令微调模型和七个推理基准上,作为独立方法可将token消耗降低11%-77%,与早期退出方法结合时最高达78.2%,同时保持或提升了评估准确率。

链接: https://arxiv.org/abs/2605.14062
作者: Anjir Ahmed Chowdhury,Syed Zawad,Feng Yan
机构: University of Houston (休斯顿大学); IBM Research (IBM研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages, 4 figures, 7 tables

点击查看摘要

Abstract:While synthetic data generation with large language models (LLMs) is widely used in post-training pipelines, existing approaches typically generate full outputs before applying quality filters, leading to substantial token waste on samples that are ultimately discarded. To address this, we propose Multi-Stage In-Flight Rejection (MSIFR), a lightweight, training-free framework that detects and terminates low-quality generation trajectories at intermediate checkpoints before they reach full completion. MSIFR decomposes the generation process into sequential stages and applies fast rule-based validators to identify arithmetic inconsistencies, hallucination patterns, and formatting violations, enabling early rejection of faulty samples. We formalize in-flight rejection as a sequential decision process and show that any non-trivial discard policy reduces expected token consumption, with stage-wise savings increasing when rejection occurs earlier in the generation pipeline. We further demonstrate that conditional utility estimates form a martingale, ensuring that early, in-flight rejection does not bias the expected utility of retained samples. Across five instruction-tuned models and seven reasoning benchmarks, MSIFR reduces token consumption by 11%-77% as a standalone method, and up to 78.2% when combined with early-exit methods, while preserving or improving evaluation accuracy. These results confirm that MSIFR provides a practical mechanism for improving the efficiency of LLM-based synthetic data generation without additional training or architectural changes.

[NLP-92] Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents ACL2026

【速读】: 该论文旨在解决现有对话系统多为用户驱动、被动响应用户请求,而在关键的真实世界场景中(如美国最高法院口头辩论)需要对话代理主动探询信息以实现自身目标的问题。解决方案的关键在于提出了一个双重分层强化学习(Dual Hierarchical Reinforcement Learning)框架,该框架包含两个具有各自策略的协同强化学习(Reinforcement Learning, RL)智能体,分别负责战略层面的对话管理和细粒度的话语生成,通过学习何时以及如何提出探究性问题,模拟司法提问模式,系统性地揭示关键信息以实现法律目标。

链接: https://arxiv.org/abs/2605.14057
作者: Xubo Lin,Zezhii Deng,Shihao Wang,Grace Hui Yang,Yang Deng
机构: Georgetown University (乔治城大学); Singapore Management University (新加坡管理大学)
类目: Computation and Language (cs.CL)
备注: Accepted in ACL 2026 as Findings

点击查看摘要

Abstract:Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emphInquisitive Conversational Agents (ICAs) and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.

[NLP-93] PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

【速读】: 该论文旨在解决现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法(如LoRA和Prefix Tuning)在多任务学习(Multi-task Learning)中的局限性:LoRA及其变体侧重于模型权重适应而忽略了提示调优(Prompt Tuning)的重要性,而Prefix Tuning虽优化提示但架构简单,限制了多任务适应能力。解决方案的关键在于提出了一种名为PEML(Parameter-Efficient Multi-task Learning)的框架,该框架通过神经架构工程(neural architecture engineering)方法联合优化连续提示(continuous prompts)并进行模型权重的低秩适应(low-rank adaptation),从而实现提示优化与模型适应的协同优化,最终在多个基准测试上取得了高达6.67%的平均准确率提升。

链接: https://arxiv.org/abs/2605.14055
作者: Anjir Ahmed Chowdhury,Syed Zawad,Xiaolong Ma,Xu Dong,Feng Yan
机构: University of Houston (休斯顿大学); IBM Research (IBM研究院); Argonne National Laboratory (阿贡国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 26 pages, 8 figures, 18 Tables

点击查看摘要

Abstract:Parameter-Efficient Fine-Tuning (PEFT) is widely used for adapting Large Language Models (LLMs) for various tasks. Recently, there has been an increasing demand for fine-tuning a single LLM for multiple tasks because it requires overall less data for fine-tuning thanks to the common features shared among tasks. More importantly, LLMs are resource demanding and deploying a single model for multiple tasks facilitates resource consolidation and consumes significantly less resources compared to deploying individual large model for each task. Existing PEFT methods like LoRA and Prefix Tuning are designed to adapt LLMs to a specific task. LoRA and its variation focus on aligning the model itself for tasks, overlooking the importance of prompt tuning in multi-task learning while Prefix Tuning only adopts a simple architecture to optimize prompts, which limits the adaption capabilities for multi-task. To enable efficient fine-tuning for multi-task learning, it is important to co-optimize prompt optimization and model adaptation. In this work, we propose a Parameter-Efficient Multi-task Learning (\PM), which employs a neural architecture engineering method for optimizing the continuous prompts while also performing low-rank adaption for model weights. We prototype PEML by creating an automated framework for optimizing the continuous prompts and adapting model weights. We evaluate PEML against state-of-the-arts multi-task learning methods MTL-LoRA, MultiLoRa, C-Poly, and MoE, on the GLUE, SuperGLUE, Massive Multitask Language Understanding, and commonsense reasoning benchmarks. The evaluation results present an average accuracy improvement of up to 6.67%, with individual tasks showing peak gains of up to 10.75%.

[NLP-94] Derivation Prompting: A Logic-Based Method for Improving Retrieval-Augmented Generation

【速读】: 该论文旨在解决大型语言模型(LLMs)在知识密集型、特定领域的问答(Question Answering)任务中出现的幻觉(hallucinations)和错误推理(erroneous reasoning)问题。解决方案的关键在于提出了一种名为“推导提示”(Derivation Prompting)的新型提示技术,该技术用于检索增强生成(Retrieval-Augmented Generation, RAG)框架的生成步骤。方法受逻辑推导的启发,通过系统性地应用预定义规则,从初始假设出发推导出结论,并构建一个可解释的推导树(derivation tree),从而增强对生成过程的控制,显著降低了不可接受答案的产生。

链接: https://arxiv.org/abs/2605.14053
作者: Ignacio Sastre,Guillermo Moncecchi,Aiala Rosá
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of Large Language Models to Question Answering has shown great promise, but important challenges such as hallucinations and erroneous reasoning arise when using these models, particularly in knowledge-intensive, domain-specific tasks. To address these issues, we introduce Derivation Prompting, a novel prompting technique for the generation step of the Retrieval-Augmented Generation framework. Inspired by logic derivations, this method involves deriving conclusions from initial hypotheses through the systematic application of predefined rules. It constructs a derivation tree that is interpretable and adds control over the generation process. We applied this method in a specific case study, significantly reducing unacceptable answers compared to traditional RAG and long-context window methods.

[NLP-95] Bridging Legal Interpretation and Formal Logic: Faithfulness Assumption and the Future of AI Legal Reasoning

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在法律实践中因系统性地得出超出源文本支持的推断并呈现带有预设假设的结论,而导致的幻觉与可信度缺失问题,根本挑战在于如何在保持AI辅助法律推理能力的同时,满足法律工作对严格性与可问责性的要求。解决方案的关键在于采用神经符号方法(neuro-symbolic approach),将大语言模型的表达力与形式验证(formal verification)的严谨性相结合,从而在减少人工验证负担的同时,确保推理结果的可信度与可追溯性。

链接: https://arxiv.org/abs/2605.14049
作者: Olivia Peiyu Wang,Leilani H. Gilpin
机构: University of California, Santa Cruz (加利福尼亚大学圣克鲁兹分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 2 pages abstract accepted by Bloomberg LSLLAI 2026 Symposium

点击查看摘要

Abstract:The growing adoption of large language models in legal practice brings both significant promise and serious risk. Legal professionals stand to benefit from AI that can reason over contracts, draft documents, and analyze sources at scale, yet the high-stakes nature of legal work demands a level of rigor that current AI systems do not provide. The central problem is not simply that LLMs hallucinate facts and references; it is that they systematically draw inferences that go beyond what the source text actually supports, presenting assumption-laden conclusions as if they were logically grounded. This proposal presents a neuro-symbolic approach to legal AI that combines the expressive power of large language models with the rigor of formal verification, aiming to make AI-assisted legal reasoning both capable and trustworthy, thus reducing the burden of manual verification without sacrificing the accountability that legal practice demands.

[NLP-96] Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

【速读】: 该论文旨在解决当前多模态物理评估(multimodal-physics evaluation)管线中三类未被察觉的构造性缺陷,这些缺陷系统地扭曲了视觉语言推理(vision-language reasoning)能力的真实测量:训练-评估污染(train-eval contamination)、翻译漂移(translation drift)和选择题饱和(MCQ saturation)。具体而言,现有公开训练池(如UGPhysics-Train, SciInstruct, MMK12)在单阶段5-gram Jaccard审计中虽显示零命中,但通过三阶段审计(Jaccard → mxbai-embed-large余弦 → Haiku-4.5 LLM评判)暴露了大量近似重复与释义候选;翻译漂移在爱沙尼亚-英语对照奥林匹克物理题中造成高达17个百分点的性能差异;而MCQ格式与开放题格式在相同权重下存在46个百分点的性能梯度。解决方案的关键在于:通过端到端审计定量揭示这些污染源,并发布四个针对性的人工制品——PhysCorp-A(经三阶段审计的多模态语料库)、PhysR1Corp(闭形式强化学习池)、PhysOlym-A(含99.8%新颖来源的保留奥林匹克评估集及双语子集)以及基于Qwen3-VL-8B-Thinking冷启动的参考训练配方Physics-R1(结合GSPO与DAPO优化)。该方案通过对齐审计与训练数据、引入严格新颖性控制的评估以及强化学习配方,在PhysOlym-A、PhysReason、OlympiadBench-Physics和PhyX MCQ上分别取得了+18.3、+15.7、+6.9和+4.1个百分点的提升,部分指标超越更大模型,从而有效修复了原有评估失真的问题。

链接: https://arxiv.org/abs/2605.14040
作者: Shan Yang
机构: Independent Researcher(独立研究员)
类目: Computation and Language (cs.CL)
备注: 10 pages, 3 tables. Project page: this https URL

点击查看摘要

Abstract:We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard - mxbai-embed-large cosine - Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 - 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 - 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

[NLP-97] Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

【速读】: 该论文旨在解决现代测试时计算(test-time compute)和智能体范式(agentic paradigms)下,语言模型处理超长序列时,Transformer架构中Key-Value缓存(KV cache)内存占用和带宽瓶颈导致的低效文本生成问题。解决方案的关键是提出自剪枝键-值注意力(Self-Pruned Key-Value Attention, SP-KV)机制,其核心在于引入一个轻量级效用预测器(utility predictor),为每个键值对(Key-Value pair)评分,并采用动态稀疏化策略:最近的KVs通过局部窗口(local window)始终可用,而较旧的KVs仅在其预测效用超过设定阈值时才被写入缓存并参与全局注意力(global attention)。该预测器与大语言模型(LLM)通过下一token预测损失(next-token prediction loss)联合进行端到端训练,且从预训练LLM检查点(checkpoint)适配而来,从而实现对输入的自适应稀疏化,通常将KV缓存大小减少3到10倍(长序列压缩比更高),显著提升内存使用和解码速度,同时几乎不损害验证损失或下游任务性能。此外,该方法揭示的结构化层和头特异性稀疏模式还可指导混合局部-全局注意力架构(hybrid local-global attention architectures)的设计。

链接: https://arxiv.org/abs/2605.14037
作者: Gergely Szilvasy(1),Manuel Faysse(1 and 2),Maria Lomeli(1),Matthijs Douze(1),Pierre-Emmanuel Mazaré(1),Loïc Cabannes(1),Wen-tau Yih(1),Hervé Jégou(1) ((1) Meta FAIR, (2) MICS, CentraleSupélec)
机构: Meta
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 28 pages, 8 figures, 8 tables

点击查看摘要

Abstract:Under modern test-time compute and agentic paradigms, language models process ever-longer sequences. Efficient text generation with transformer architectures is increasingly constrained by the Key-Value cache memory footprint and bandwidth. To address this limitation, we introduce Self-Pruned Key-Value Attention (SP-KV), a mechanism designed to predict future KV utility in order to reduce the size of the long-term KV cache. This strategy operates at a fine granularity: a lightweight utility predictor scores each key-value pair, and while recent KVs are always available via a local window, older pairs are written in the cache and used in global attention only if their predicted utility surpasses a given threshold. The LLM and the utility predictor are trained jointly end-to-end exclusively through next-token prediction loss, and are adapted from pretrained LLM checkpoints. Rather than enforcing a fixed compression ratio, SP-KV performs dynamic sparsification: the mechanism adapts to the input and typically reduces the KV cache size by a factor of 3 to 10\times , longer sequences often being more compressible. This leads to vast improvements in memory usage and decoding speed, with little to no degradation of validation loss nor performance on a broad set of downstream tasks. Beyond serving as an effective KV-cache reduction mechanism, our method reveals structured layer- and head-specific sparsity patterns that we can use to guide the design of hybrid local-global attention architectures. Comments: 28 pages, 8 figures, 8 tables Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) ACMclasses: I.2.6; I.2.7 Cite as: arXiv:2605.14037 [cs.LG] (or arXiv:2605.14037v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14037 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-98] Enhanced and Efficient Reasoning in Large Learning Models

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在生成流畅文本时缺乏对内容可信度的原理性保障的问题,即无法基于可解释的逻辑推理来证明输出内容的正确性。解决方案的关键在于提出一种高效且实用的两阶段方法:第一阶段对输入数据进行预处理,将其重新编码为“一元关系集成编码”(Unary Relational Integracode),该编码显式地捕捉文本中描述的对象间关系,从而将原本分布式的信息集中呈现;第二阶段采用标准但可精简的机器学习过程,同时学习预测这些显式关系。该重新编码具有一个关键特性:它使学习训练数据中世界核心关系规则子集的任务在多项式时间内可学习(多项式复杂度取决于规则本身),从而在单次或多次调用学习分类器时,能够通过鲁棒逻辑(Robust Logic)系统进行可靠的逻辑链推理,最终在保证计算可行性的前提下大幅提升模型输出的可信度。

链接: https://arxiv.org/abs/2605.14036
作者: Leslie G. Valiant
机构: John A. Paulson School of Engineering and Applied Sciences, Harvard University (哈佛大学约翰·A·保尔森工程与应用科学学院)
类目: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In current Large Language Models we can trust the production of smoothly flowing prose on the basis of the principles of machine learning. However, there is no comparably principled basis to justify trust in the content of the text produced. It appears to be conventional wisdom that addressing this issue by adding more principled reasoning is not computationally affordable. Here we propose a principled method of reasoning that is efficient enough to be practical for large language models. Further, the method allows the retention of much of the currently used software and hardware base. Our method for improving the functioning of large language models consists of a first stage of preprocessing that recodes the data to a Unary Relational Integracode that is more explicit about the relationships among the objects described in the text, followed as a second stage by a standard but possibly streamlined machine learning process that then also learns to predict these relationships. The method may be viewed as realizing a world model and applying beyond natural language, to vision and actions, for example, where the multiple properties of an object referred to in an input are brought together explicitly, rather than remaining distributed in the various references to it in the input. We articulate its advantages in terms of Robust Logic, a system for performing principled chaining on learned, and hence uncertain, information. We show that this recoding has the surprising and fortuitous property that, while succinct, it makes the task of learning a core subset of relational rules that hold in the world described in the training data polynomial time learnable in a defined sense, the polynomial depending on the complexity of the rule. This gives support for sound reasoning within each single call of the learned classifier as well as between multiple calls. Subjects: Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Computation and Language (cs.CL); Machine Learning (cs.LG) ACMclasses: I.2.6; I.2.7; F.2.2 Cite as: arXiv:2605.14036 [cs.AI] (or arXiv:2605.14036v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14036 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-99] From Descriptive to Prescriptive: Uncover the Social Value Alignment of LLM -based Agents

【速读】: 该论文旨在解决基于大语言模型(LLM)的智能体在自我认知(self-cognition)、困境决策(dilemma decision)以及自我情感(self-emotions)方面存在的缺陷,使其难以与人类社交价值观实现强对齐。解决方案的关键在于提出一个新颖的基于价值(value-based)的框架:利用图增强检索生成(GraphRAG)将抽象原则转化为基于价值的指令,并在特定对话上下文中通过检索合适的指令来引导智能体表现出期望行为;同时,借助马斯洛需求层次理论(Maslow’s Hierarchy of Needs)和普拉奇克情绪轮(Plutchik’s Wheel of Emotion)两种经典理论来定义与评估期望行为比率,从而为AI系统中自我情感的出现提供基础。

链接: https://arxiv.org/abs/2605.14034
作者: Jinxian Qu,Qingqing Gu,Teng Chen,Luo Ji
机构: Geely AI Lab (吉利人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted by CogSci 2026

点击查看摘要

Abstract:Wide applications of LLM-based agents require strong alignment with human social values. However, current works still exhibit deficiencies in self-cognition and dilemma decision, as well as self-emotions. To remedy this, we propose a novel value-based framework that employs GraphRAG to convert principles into value-based instructions and steer the agent to behave as expected by retrieving the suitable instruction upon a specific conversation context. To evaluate the ratio of expected behaviors, we define the expected behaviors from two famous theories, Maslow’s Hierarchy of Needs and Plutchik’s Wheel of Emotion. By experimenting with our method on the benchmark of DAILYDILEMMAS, our method exhibits significant performance gains compared to prompt-based baselines, including ECoT, Plan-and-Solve, and Metacognitive prompting. Our method provides a basis for the emergence of self-emotion in AI systems.

[NLP-100] Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

【速读】: 该论文试图解决基于模型的推测解码(speculative decoding)中因起草模型(drafter)与目标模型(target model)分布近似不完美而引入的机制级安全漏洞,具体表现为攻击者可通过微小扰动在保持目标模型输出语义质量的同时,大幅降低每步验证中存活的平均接受长度(average accepted length, τ),从而崩溃推理加速效果。解决方案的关键在于提出了名为Mistletoe的隐蔽加速崩溃攻击,它直接针对推测解码的接受机制,联合优化一个降低起草-目标一致性的退化目标(degradation objective)与一个保持目标模型输出分布的语义保持目标(semantic-preservation objective),并通过引入零空间投影机制(null-space projection)将退化梯度(degradation gradient)投影到局部语义保持方向的零空间中,从而在抑制草案令牌接受性的同时最小化语义漂移,最终达到在输出质量和困惑度(perplexity)不受显著影响的前提下有效降低τ、崩溃加速比和平均令牌吞吐量的攻击效果。

链接: https://arxiv.org/abs/2605.14005
作者: Shuoyang Sun,Chang Da,Hao Fang,Kuofeng Gao,Xinhao Zhong,Yi Sun,Fan Mo,Shu-Tao Xia,Bin Chen
机构: Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)); South China University of Technology(华南理工大学); Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院); Huawei Technology(华为技术)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Speculative decoding has become a widely adopted technique for accelerating large language model (LLM) inference by drafting multiple candidate tokens and verifying them with a target model in parallel. Its efficiency, however, critically depends on the average accepted length \tau , i.e., how many draft tokens survive each verification step. In this work, we identify a new mechanism-level vulnerability in model-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect. Such a drafter-target mismatch creates a hidden attack surface where small perturbations can preserve the target model’s visible behavior while substantially reducing draft-token acceptability. We propose Mistletoe, a stealthy acceleration-collapse attack against speculative decoding. Mistletoe directly targets the acceptance mechanism of speculative decoding. It jointly optimizes a degradation objective that decreases drafter-target agreement and a semantic-preservation objective that constrains the target model’s output distribution. To resolve the conflict between these objectives, we introduce a null-space projection mechanism, where degradation gradients are projected away from the local semantic-preserving direction, suppressing draft acceptance while minimizing semantic drift. Experiments on various speculative decoding systems show that Mistletoe substantially reduces average accepted length \tau , collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity. Our work highlights that speculative decoding introduces a mechanism-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems.

[NLP-101] HodgeCover: Higher-Order Topological Coverag e Drives Compression of Sparse Mixture-of-Experts

【速读】: 该论文试图解决学习无关压缩稀疏混合专家层(Sparse Mixture-of-Experts, MoE)时遇到的一个根本性障碍:三个专家可能两两兼容,但合并时却形成不可约循环(irreducible cycle),导致任何基于成对信号的评分方法在结构上无法识别哪些三元组是可联合合并的。解决方案的关键在于:作者将该障碍精确建模为定义在专家为顶点的2-复形上的调和核(harmonic kernel of the simplicial Laplacian),其中边携带KL散度合并障碍,面携带三元组障碍;通过霍奇分解(Hodge decomposition)将边障碍信号分解,从而隔离出调和核。基于这一诊断,他们提出HodgeCover算法,贪婪地覆盖调和关键边和三元组关键三角,并设计了与权重剪枝结合的混合变体。该方法在激进专家压缩场景下取得了领先性能,揭示了暴露所学MoE结构的调和核能够改变压缩策略在最关键工况下的胜负格局。

链接: https://arxiv.org/abs/2605.13997
作者: Tao Zhong,Dongzhe Zheng,Christine Allen-Blanchette
机构: Princeton University(普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 8 figures

点击查看摘要

Abstract:Sparse Mixture-of-Experts (MoE) layers route tokens through a handful of experts, and learning-free compression of these layers reduces inference cost without retraining. A subtle obstruction blocks every existing compressor in this family: three experts can each be pairwise compatible yet form an irreducible cycle when merged together, so any score that ranks experts on pairwise signals is structurally blind to which triples are jointly mergeable. We show the obstruction is a precise mathematical object, the harmonic kernel of the simplicial Laplacian on a 2-complex whose vertices are experts, whose edges carry KL merge barriers, and whose faces carry triplet barriers; Hodge-decomposing the edge-barrier signal isolates the kernel exactly. We turn the diagnostic into a selection objective: HodgeCover greedily covers the harmonic-critical edges and triplet-critical triangles, and a hybrid variant of HodgeCover pairs it with off-the-shelf weight pruning on survivors. On three open-weight Sparse MoE backbones under aggressive expert reduction, HodgeCover matches state-of-the-art learning-free baselines on the expert-reduction axis, leads on the aggressive-compression frontier of the hybrid axis, and uniquely balances retained mass across all four Hodge components. These results show that exposing the harmonic kernel of a learned MoE structure changes which compressor wins at the regime that matters most.

[NLP-102] VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

【速读】: 该论文旨在解决两个核心问题:一是缺乏以西班牙语(尤其聚焦拉丁美洲)为原生语言、具备原生工具调用能力(通过Model Context Protocol, MCP)的网络安全大语言模型(LLM);二是探究小规模模型(42M参数)在工具选择能力上的瓶颈是否源于模型容量限制,而非训练数据特征。解决方案的关键包括:(1)构建了VectraYX-Sec-ES语料库,这是一个170M token的西班牙语网络安全语料,通过低成本八虚拟机流水线(约25美元)从对话、网络安全和攻击性安全工具三个领域收集,并设计课程学习(curriculum with replay)实现连续预训练损失单调下降(从9.80降至2.16);(2)采用42M参数的decoder-only Transformer架构,集成GQA、QK-Norm、RMSNorm、SwiGLU、RoPE、z-loss以及16,384 token字节回退BPE分词器,并在SFT阶段加入6,327条工具使用轨迹数据,使对话门控(conversational gate)达到0.78±0.05;(3)通过语料消融实验发现纳米尺度下的loss-vs-register反转现象,以及通过LoRA研究证明工具选择基线(B4)的低值(0.000)是语料密度伪影而非容量限制:增加工具密集语料(2,801个示例)后,42M模型B4提升至0.145±0.046,260M模型提升至0.445±0.201,从而揭示在极小模型规模下,工具选择能力的关键制约因素是可用的高质量工具相关训练样本密度。

链接: https://arxiv.org/abs/2605.13989
作者: Juan S. Santillana
机构: Globant(格洛班特)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, preprint

点击查看摘要

Abstract:We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American focus and native tool invocation via the Model Context Protocol (MCP). Four contributions: (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus from an eight-VM pipeline (~ 25 USD) partitioned into conversational (42M tokens, OpenSubtitles-ES, OASST1), cybersecurity (118M tokens, NVD, Wikipedia-ES, CVE mirror, security blogs), and offensive-security tooling (10M tokens, ExploitDB, HackTricks, OWASP) phases. (ii) Architecture: 42M-parameter Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE, z-loss, and a 16,384-token byte-fallback BPE. (iii) Curriculum with replay: continual pre-training with a replay buffer yields monotonic loss descent (9.80-3.17-3.00-2.16); after SFT on OASST-ES, Alpaca-ES, CVE QA, and 6,327 tool-use traces, the model attains a conversational gate of 0.78±0.05 (N=4 seeds). (iv) Two findings: a bootstrap-corpus ablation reveals a loss-vs-register inversion at nano scale; a LoRA study shows the B4 tool-selection floor of 0.000 is a corpus-density artifact, not a capacity gate – a tool-dense corpus (2,801 examples) raises B4 to 0.145±0.046 on Nano 42M and 0.445±0.201 on a 260M mid-tier. The GGUF artifact is 81 MB (F16), runs at sub-second TTFT on commodity hardware under this http URL, and is to our knowledge the first Spanish-native cybersecurity LLM with end-to-end MCP integration. Corpus recipe, training scripts, GGUF weights, and B1-B5 benchmark are released.

[NLP-103] Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

【速读】: 该论文试图解决扩散语言模型(diffusion language models)在后训练(post-training)阶段采用奖励最大化目标时出现的一种关键失效模式——“轨迹锁定”(trajectory locking),即基于奖励驱动的采样更新会使概率质量过度集中于狭窄的去噪路径(denoising paths),从而在重复采样时减少对其它正确解决方案的覆盖。解决方案之关键在于提出TraFL(Trajectory Flow baLancing),一种轨迹平衡(trajectory-balance)目标,它将策略(policy)训练引导至一个以冻结参考模型(frozen reference model)为锚点的、经奖励倾斜的目标分布(reward-tilted target distribution),并通过一个兼容扩散过程的序列级替代损失(diffusion-compatible sequence-level surrogate)和一个学习的提示相关归一化项(learned prompt-dependent normalization)使其在扩散语言模型中实用化。实验证明,TraFL在数学推理和代码生成基准测试中,是唯一在所有评估长度设置下均优于基础模型的后训练方法,且收益随采样预算增加而持续。

链接: https://arxiv.org/abs/2605.13935
作者: Saba Ahmadi,Prasanna Parthasarathi,Yufei Cui
机构: Noah’s Ark Lab (诺亚方舟实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Diffusion language models are a promising alternative to autoregressive models, yet post-training methods for them largely adapt reward-maximizing objectives. We identify a central failure mode in this setting we call trajectory locking: sampled reward-driven updates over-concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. To address this, we propose TraFL (Trajectory Flow baLancing), a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. We make this practical for diffusion language models with a diffusion-compatible sequence-level surrogate and a learned prompt-dependent normalization. Across mathematical reasoning and code generation benchmarks, TraFL is the only evaluated post-training method that improves over the base model in every benchmark-length setting, with gains that persist as the sampling budget increases. The improvements transfer to held-out evaluations: TraFL stays above the base model on Minerva Math and is the strongest method on every LiveCodeBench difficulty split.

[NLP-104] Merging Methods for Multilingual Knowledge Editing for Large Language Models : An Empirical Odyssey

【速读】: 该论文旨在解决多语言知识编辑(Multilingual Knowledge Editing, MKE)中语言特定编辑之间的相互干扰问题,即在单语言环境下表现良好的“定位-修改”(locate-then-edit)方法在跨语言场景下因编辑冲突而失效。解决方案之关键在于评估向量合并方法(vector merging methods)的有效性,特别是考察共享协方差的向量求和(vector summation with shared covariance)、任务奇异向量合并(Task Singular Vectors for Merging, TSVM)以及权重缩放因子(weight scaling factor)与秩压缩比(rank compression ratio)对多语言干扰的缓解作用,并基于大规模批量编辑(large-scale batch-editing)实验发现:带有共享协方差的向量求和是最可靠的总体策略,而TSVM虽能在部分设置下提升性能但缓解干扰能力有限,同时性能对超参数敏感,较大的缩放因子和相对较低的秩通常能获得更好结果。

链接: https://arxiv.org/abs/2605.13919
作者: Kunil Lee,Ki-Young Shin,Jong-Hyeok Lee,Young-Joo Suh
机构: Department of Computer Science and Engineering, POSTECH (浦项科技大学计算机科学与工程系); Designovel Co., Ltd. (Designovel有限公司); LLSOLLU; Graduate School of Artificial Intelligence, POSTECH (浦项科技大学人工智能研究生院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Multilingual knowledge editing (MKE) remains challenging because language-specific edits interfere with one another, even when locate-then-edit methods work well in monolingual settings. This paper focuses on three issues: the effectiveness of vector merging methods for MKE, the extent to which Task Singular Vectors for Merging (TSVM) can reduce multilingual interference, and the influence of the weight scaling factor and rank compression ratio on performance. We evaluate six merging variants with two popular backbone large language models, two base knowledge editing methods, and 12 languages on the MzsRE benchmark under a large-scale batch-editing setting. Our results show that vector summation with shared covariance is the most reliable overall strategy, whereas simple summation without shared covariance performs poorly. TSVM improves performance in some settings, but its ability to mitigate multilingual interference is limited. We also find that performance is sensitive to both weight scale and rank ratio, with larger-than-default scaling and relatively low rank often yielding better results. These findings clarify the practical strengths and limits of current vector merging methods for MKE and provide guidance for future multilingual knowledge editing research.

[NLP-105] PREPING: Building Agent Memory without Tasks

【速读】: 该论文旨在解决智能体(agent)在初次进入无任何任务特定经验的新环境时所面临的冷启动鸿沟(cold-start gap)问题,即如何在不依赖离线示范或在线交互的情况下,通过自主生成的合成练习(synthetic practice)构建任务前的过程记忆(pre-task procedural memory)。解决方案的关键在于提出了Preping框架,其核心是一个名为proposer memory的结构化控制状态(structured control state),该状态能指导未来合成任务的生成方向;具体地,一个Proposer基于该状态生成合成任务,一个Solver执行这些任务,而一个Validator则决定哪些轨迹(trajectories)可被插入记忆,同时提供反馈以优化后续提议。该方法的主要优势并非来自合成数据的数量,而是来自提议端(proposer-side)对可行性(feasibility)、冗余性(redundancy)和覆盖度(coverage)的控制,并结合选择性记忆更新(selective memory updates),从而在显著降低部署成本的同时达到了与基于离线或在线经验的方法竞争的性能。

链接: https://arxiv.org/abs/2605.13880
作者: Yumin Choi,Sangwoo Park,Minki Kang,Jinheon Baek,Sung Ju Hwang
机构: KAIST(韩国科学技术院); DeepAuto.ai
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Agent memory is typically constructed either offline from curated demonstrations or online from post-deployment interactions. However, regardless of how it is built, an agent faces a cold-start gap when first introduced to a new environment without any task-specific experience available. In this paper, we study pre-task memory construction: whether an agent can build procedural memory before observing any target-environment tasks, using only self-generated synthetic practice. Yet, synthetic interaction alone is insufficient, as without controlling what to practice and what to store, synthetic tasks become redundant, infeasible, and ultimately uninformative, and memory further degrades quickly due to unfiltered trajectories. To overcome this, we present Preping, a proposer-guided memory construction framework. At its core is proposer memory, a structured control state that shapes future practice. A Proposer generates synthetic tasks conditioned on this state, a Solver executes them, and a Validator determines which trajectories are eligible for memory insertion while also providing feedback to guide future proposals. Experiments on AppWorld, BFCL v3, and MCP-Universe show that Preping substantially improves over a no-memory baseline and achieves performance competitive with strong playbook-based methods built from offline or online experience, with deployment cost 2.99\times lower on AppWorld and 2.23\times lower on BFCL v3 than online memory construction. Further analyses reveal that the main benefit does not come from synthetic volume alone, but from proposer-side control over feasibility, redundancy, and coverage, combined with selective memory updates.

[NLP-106] A Hormone-inspired Emotion Layer for Transformer language models (HELT)

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在生成上下文相关且语法正确的文本时,无法像人类情绪认知那样处理和响应情感语境的问题;现有基于离散情绪分类或简单情感分析(sentiment analysis)的方法无法捕捉人类情绪状态的连续、多维特性。解决方案的关键在于提出了一种名为 HormoneT5 的新型架构,该架构通过一个受生物学启发的“激素情绪模块”(Hormone Emotion Block)来模拟人体内分泌系统在情绪处理中的作用。具体实现中,该模块通过六个专门针对每种激素的注意力头(per-hormone attention heads),利用正交初始化的可学习查询、温度缩放注意力机制(temperature-scaled attention mechanisms)和深度输出投影,计算出一组连续且代表不同激素的六维数值。这些激素数值被进一步转化为一个情绪嵌入(emotional embedding),对编码器隐状态进行调制,从而生成情感上适当的响应。此外,论文还提出了一种多目标训练框架,结合序列到序列损失(sequence-to-sequence loss)、带边际惩罚的激素预测损失(hormone prediction loss with margin penalties)以及防止注意力崩溃的多样性正则化(diversity regularization),使得模型在六种激素上的预测准确率达到85%以上,且在对比情感基调下激素差异化范围超过0.85,人类评估也证实了其在情绪恰当性和共情质量上的显著优势。

链接: https://arxiv.org/abs/2605.13858
作者: Eslam Reda,Sara El-Metwally
机构: Mansoura University (曼苏拉大学); Faculty of Computers and Information (计算机与信息学院); Artificial Intelligence Program (人工智能项目); Computer Science Department (计算机科学系)
类目: Neural and Evolutionary Computing (cs.NE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 24 pages, 5 figures

点击查看摘要

Abstract:Large Language Models have demonstrated remarkable capabilities in generating contextually relevant and grammatically correct text. However, they fundamentally lack the ability to process and respond to emotional context in a manner analogous to human emotional cognition. Current approaches to emotion modeling in NLP systems rely primarily on discrete emotion classification or simplistic sentiment analysis, which fail to capture the continuous, multi-dimensional nature of human emotional states. In this paper, we introduce HormoneT5, a novel architecture that augments transformer language models with a biologically-inspired Hormone Emotion Block that simulates the human endocrine system’s role in emotional processing. Our approach computes six continuous hormone-like values through specialized per-hormone attention heads, each with orthogonally initialized learnable queries, temperature-scaled attention mechanisms, and deep output projections. These hormone values are then transformed into an emotional embedding that modulates the encoder hidden states, enabling emotionally-appropriate response generation. We propose a multi-objective training framework combining sequence-to-sequence loss, hormone prediction loss with margin penalties, and diversity regularization to prevent attention collapse. Experimental results on our curated emotion-labeled dataset demonstrate that HormoneT5 achieves 85%+ per-hormone accuracy within a 0.15 tolerance threshold, with hormone differentiation ranges exceeding 0.85 across all six hormones between contrasting emotional tones. Human evaluation studies show significant preference (p 0.01) for HormoneT5-generated responses in terms of emotional appropriateness and empathetic quality compared to baseline T5 outputs. Our work opens new directions for biologically-grounded affective computing and emotionally intelligent conversational agents.

[NLP-107] GraphBit: A Graph-based Agent ic Framework for Non-Linear Agent Orchestration

【速读】: 该论文旨在解决基于提示编排(prompted orchestration)的智能体大语言模型(Agentic LLM)框架中普遍存在的幻觉式路径选择、无限循环以及执行不可复现等问题。解决方案的关键在于引入引擎编排(engine-orchestrated)范式,具体通过GraphBit框架实现:将工作流显式且确定性地定义为有向无环图(DAG),智能体作为类型化函数(typed functions)运行,由基于Rust的引擎控制路由、状态转换和工具调用,从而保证可复现性与可审计性;同时,引擎支持并行分支执行、基于结构化状态谓词的条件控制流、可配置的错误恢复机制,并采用包含临时暂存空间、结构化状态和外部连接器的三层内存架构,隔离各阶段上下文、防止级联上下文膨胀,最终在GAIA基准测试中实现了零框架诱发幻觉、最高准确率(67.6%)及最低延迟(11.9 ms开销)等性能优势。

链接: https://arxiv.org/abs/2605.13848
作者: Yeahia Sarker,Md Rahmat Ullah,Musa Molla,Shafiq Joty
机构: MTSU(中田纳西州立大学); InfinitiBit GmbH(InfinitiBit有限责任公司); Salesforce Research(Salesforce研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 12 pages, 5 figures, 4 tables. Submitted to arXiv, under review

点击查看摘要

Abstract:Agentic LLM frameworks that rely on prompted orchestration, where the model itself determines workflow transitions, often suffer from hallucinated routing, infinite loops, and non-reproducible execution. We introduce GraphBit, an engine-orchestrated framework that defines workflows explicitly and deterministically as a directed acyclic graph (DAG). Unlike prompted orchestration, agents in GraphBit operate as typed functions, while a Rust-based engine governs routing, state transitions, and tool invocation, ensuring reproducibility and auditability. The engine supports parallel branch execution, conditional control flow over structured state predicates, and configurable error recovery. A three-tier memory architecture consisting of ephemeral scratch space, structured state, and external connectors isolates context across stages, preventing cascading context bloat that degrades reasoning in long-running pipelines. Across GAIA benchmark tasks spanning zero-tool, document-augmented, and web-enabled workflows, GraphBit outperforms six existing frameworks, achieving the highest accuracy (67.6 percent), zero framework-induced hallucinations, the lowest latency (11.9 ms overhead), and the highest throughput. Ablation studies demonstrate that each memory tier contributes measurably to performance, with deterministic execution providing the greatest gains on tool-intensive tasks representative of real-world deployments.

[NLP-108] Controlling Logical Collapse in LLM s via Algebraic Ontology Projection over F2

【速读】: 该论文试图解决的核心问题在于:大语言模型(Large Language Models)内部是否以形式可验证的代数结构(algebraic structure)编码本体论关系(ontological relations),以及如何从模型隐藏状态中显式地提取并验证这种结构。解决方案的关键在于引入代数本体投影(Algebraic Ontology Projection, AOP),该方法将LLM隐藏状态投影到伽罗瓦域F2(Galois Field F2)中,并在里氏替换原则(Liskov Substitution Principle)约束下,仅使用42个关系对作为代数键(algebraic keys)实现零样本分类。此外,系统提示(system prompts)被视为代数边界条件(algebraic boundary conditions),其与指令微调(instruction tuning)的组合能够防止后期层坍塌(Late-layer Collapse)——即在最后几层中逻辑一致性系统性退化的现象。同时,作者提出语义结晶(Semantic Crystallisation, SC)度量,通过量化F2约束满足度(相对于随机基线)来预测零样本准确率,从而无需依赖保留数据集即可评估代数结构的有效性。

链接: https://arxiv.org/abs/2605.12968
作者: Hisashi Miyashita,Mgnite Inc
机构: Mgnite Inc.(Mgnite公司)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families – with no model tuning, but through prompt alone. This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse – a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO) Cite as: arXiv:2605.12968 [cs.LG] (or arXiv:2605.12968v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.12968 Focus to learn more arXiv-issued DOI via DataCite

[NLP-109] Hidden State Poisoning Attacks against Mamba-based Language Models

【速读】: 该论文旨在揭示并量化状态空间模型(State Space Models, SSMs)如Mamba及混合模型Jamba-1.7-Mini在面对一种称为隐态投毒攻击(Hidden State Poisoning Attack, HiSPA)时的脆弱性——特定短输入短语通过不可逆地覆写隐态信息引发局部遗忘效应,导致模型信息检索能力崩溃,而纯Transformer模型不受此影响。解决方案的关键在于:首先,构建了专用基准RoBench-25以系统评估HiSPA攻击下模型的检索能力,证实SSMs的易感性;其次,通过可解释性研究识别Mamba隐藏层在攻击中的内部模式,为设计缓解系统提供理论依据与实证基础。

链接: https://arxiv.org/abs/2601.01972
作者: Alexandre Le Mercier,Chris Develder,Thomas Demeester
机构: IDLab–T2K, Ghent University–imec (IDLab–T2K, 根特大学–imec)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages, 4 figures

点击查看摘要

Abstract:State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model’s information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM–Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba’s hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at this https URL.

[NLP-110] QOuLiPo: What a quantum computer sees when it reads a book

【速读】: 该论文试图解决的问题是如何将经典文学作品的结构映射到中性原子量子处理器(neutral-atom quantum processor)上,从而建立一个既能分析文本结构刚性(rigidity rho)、又能为量子硬件提供可扩展基准测试的应用层框架。解决方案之关键在于:将文本的每个单元(如章节或故事)视为图节点,通过物理封锁约束(blockade constraints)或语义图的二维近似构造边,从而将文本转化为量子处理器天然编码的单元圆盘图(unit-disk graph);同时引入逆向流程,即预先选择硬件原生编码的目标图,再据此编写结构匹配的文本(即 QOuLiPo 作品),使得工程文本能在 Pasqal 的 FRESNEL 处理器上达到高近似比(approximation ratio),从而为未来量子硬件的规模化提供可比较的基准分布。这一方法不追求量子加速,而是为数字人文学科(Digital Humanities)提供一个可直接加载到中性原子处理器上的实例层,并确保当前的设计选择能固定未来硬件评测的标准。

链接: https://arxiv.org/abs/2605.14188
作者: Christophe Jurczak
机构: 未知
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Digital Libraries (cs.DL); Atomic Physics (physics.atom-ph)
备注:

点击查看摘要

Abstract:What does a book look like to a quantum computer? This paper takes eight classical works of the Renaissance and its late-antique inheritance – from Augustine to Galileo – and runs each through a neutral-atom quantum processor. The bridge is graphs: each textual unit becomes an atom, and graph edges are physical blockade constraints for engineered exact unit-disk designs, or a 2D approximation to the semantic graph for natural texts. Three contributions follow. First, we introduce rigidity rho, a metric for how unique a book’s structural backbone is – distinguishing Marguerite de Navarre’s Heptameron (rigid, twelve-nouvelle hard core) from Boethius (fully fungible, every chapter substitutable). Second, we invert the pipeline: rather than extracting a graph from existing prose, we pick a target graph the hardware encodes natively, and write a book whose structure matches it. The twenty-nine texts written this way, collected under the name QOuLiPo, extend the OuLiPo tradition to graph-topological constraints and, together with the eight natural texts, form a benchmark distribution against which neutral-atom hardware can be tracked as it scales. Third, we run both natural and engineered texts on Pasqal’s FRESNEL processor up to one hundred atoms; engineered texts reach high approximation ratios, the cleanest instances returning the exact backbone. A cloud-accessible quantum machine plus an agentic coding environment now lets a single investigator run this pipeline end-to-end. What is reported is an application layer, not a speedup – humanistic instances ready to load onto neutral-atom processors as they scale, already complementing classical text analysis. The Digital Humanities community has a stake in building familiarity with this hardware now: the engineered-corpus design choices made today fix the benchmark distribution future hardware will be measured against. Subjects: Quantum Physics (quant-ph); Computation and Language (cs.CL); Digital Libraries (cs.DL); Atomic Physics (physics.atom-ph) Cite as: arXiv:2605.14188 [quant-ph] (or arXiv:2605.14188v1 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2605.14188 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[NLP-111] Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

【速读】: 该论文试图解决链式思维推理(Chain-of-thought, CoT)与自一致性(self-consistency)结合时,由于候选推理路径聚合规则引发的不确定性(aggregation uncertainty)问题,特别是当模型以高置信度输出错误答案时,其代价远超弃权(abstention)的场景。解决方案的关键在于引入共形程序(conformal procedure):首先用加权分数聚合(weighted score aggregation)取代多数投票(majority voting),然后利用共形风险控制(conformal risk control)校准一个弃权规则,从而在有限样本下将自信错误率(confident-error rate)控制在预设目标内。此外,论文识别出分数可分离性(score separability)是弃权能可证明地提升选择性精度(selective accuracy)的核心条件,并推导出仅凭校准数据即可预测精度增益的闭式表达式。该方法完全在推理阶段执行,无需重新训练。

链接: https://arxiv.org/abs/2605.14098
作者: Yu Gu,Zijun Yu,Vahid Partovi Nia,Masoud Asgharian
机构: Department of Mathematics and Statistics, McGill University (麦吉尔大学数学与统计系); Department of Mathematics and Industrial Engineering, Polytechnique de Montréal (蒙特利尔理工学院数学与工业工程系)
类目: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures, submitted

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning with self-consistency improves performance by aggregating multiple sampled reasoning paths. In this setting, correctness is no longer tied to a single reasoning trace but to the aggregation rule over a pool of candidate paths, making aggregation uncertainty the central challenge. This issue is critical where confidently incorrect answers are far more costly than abstentions. We introduce a conformal procedure for CoT reasoning that directly addresses aggregation uncertainty. Our approach replaces majority voting with weighted score aggregation over reasoning paths and calibrates an abstention rule using conformal risk control. This approach leads to finite-sample guarantees on the confident-error rate–the probability that the system answers and is wrong. We further identify score separability as the key condition under which abstention provably improves selective accuracy, and derive closed-form expressions that predict accuracy gains from calibration data alone. The method is fully inference-time, and requires no retraining. Across four benchmarks, four open-source models, and three score classes, realized confident-error rates are consistent with the prescribed targets up to calibration-split and test-set variability. Our method achieves 90.1% selective accuracy on GSM8K by abstaining on less than 5% of problems, compared with 82% accuracy under majority-voting baseline.

[NLP-112] A Benchmark for Early-stage Parkinsons Disease Detection from Speech INTERSPEECH2026

【速读】: 该论文试图解决早期帕金森病(Early-stage Parkinson’s disease, EarlyPD)语音检测领域中,由于不同研究在数据集、语言、任务、评估协议及疾病定义上的差异,导致已发表结果难以直接比较和复现的问题。解决方案的关键在于提出了一个公开可用的标准化基准(benchmark),其核心包括:采用说话人独立划分(speaker-independent split)以确保跨方法评估的公平性与可复现性;覆盖三种常见语音任务并在不同训练资源设置下评估方法;以及提供按数据集、聚合级别、性别和疾病阶段的多维度评估细分,从而支持细粒度比较并促进临床采用。

链接: https://arxiv.org/abs/2605.14066
作者: Terry Yi Zhong,Cristian Tejedor-Garcia,Khiet P. Truong,Janna Maas,Louis ten Bosch,Bastiaan R. Bloem
机构: Centre for Language Studies, Radboud University, Nijmegen, the Netherlands(拉德堡德大学语言研究中心); Center of Expertise for Parkinson and Movement Disorders, Radboud University Medical Center, Nijmegen, the Netherlands(拉德堡德大学医学中心帕金森与运动障碍专长中心)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: Submitted to Interspeech2026

点击查看摘要

Abstract:Early-stage Parkinson’s disease (EarlyPD) detection from speech is clinically meaningful yet underexplored, and published results are hard to compare because studies differ in datasets, languages, tasks, evaluation protocols, and EarlyPD definitions. To address this issue, we propose the first benchmark for speech-based EarlyPD detection, with a speaker-independent split designed for fair and replicable cross-method evaluation on researcher-accessible datasets. The benchmark covers three common speech tasks and evaluates methods under different training-resource settings. We also present multi-dimensional evaluation breakdowns by dataset, aggregation level, gender, and disease stage to support fine-grained comparisons and clinical adoption. Our results provide a replicable reference and actionable insights, encouraging the adoption of this publicly available benchmark to advance robust and clinically meaningful EarlyPD detection from speech.

信息检索

[IR-0] MemEye: A Visual-Centric Evaluation Framework for Multimodal Agent Memory

链接: https://arxiv.org/abs/2605.15128
作者: Minghao Guo,Qingyue Jiao,Zeru Shi,Yihao Quan,Boxuan Zhang,Danrui Li,Liwei Che,Wujiang Xu,Shilong Liu,Zirui Liu,Mubbasir Kapadia,Vladimir Pavlovic,Jiang Liu,Mengdi Wang,Yiyu Shi,Dimitris N. Metaxas,Ruixiang Tang
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 46 pages, 15 figures

点击查看摘要

Abstract:Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.

[IR-1] Why Neighborhoods Matter: Traversal Context and Provenance in Agent ic GraphRAG IJCAI ECAI2026

链接: https://arxiv.org/abs/2605.15109
作者: Riccardo Terrenzi,Maximilian von Zastrow,Serkan Ayvaz
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages, 2 figures, Submitted at IJCAI-ECAI 2026 Joint Workshop on GENAIK and NORA

点击查看摘要

Abstract:Retrieval-Augmented Generation can improve factuality by grounding answers in external evidence, but Agentic GraphRAG complicates what it means for citations to be faithful. In these systems, an agent explores a knowledge graph before producing an answer and a small set of citations. We frame citation faithfulness as a trajectory-level problem: final citations should not only support the answer, but also account for the graph traversal, structure, and visited-but-uncited entities that may influence it. Through controlled ablation experiments, we compare the effects of isolating, removing, and masking cited and uncited graph entities. Our results show that cited evidence is often necessary, as removing it substantially changes answers and reduces accuracy. However, citations are not sufficient, because accurate answers can also depend on uncited traversal context and surrounding graph structure. These findings suggest that citation evaluation in Agentic GraphRAG should move beyond source support toward provenance over the broader retrieval trajectory.

[IR-2] Croissant Baker: Metadata Generation for Discoverable Governable and Reusable ML Datasets

链接: https://arxiv.org/abs/2605.15079
作者: Rafi Al Attrach,Rajna Fani,Sebastian Lobentanzer,Joan Giner-Miguelez,Debanshu Das,Varuni H. K.,Nobin Sarwar,Rajat Ghosh,Anwai Archit,Surbhi Motghare,Christina Conrad Parry,Luis Oala,Lara Grosso,Joaquin Vanschoren,Steffen Vogler,Sujata Goswami,Eric S. Rosenthal,Marzyeh Ghassemi,Matthew McDermott,Tom Pollard
类目: Machine Learning (cs.LG); Databases (cs.DB); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: 23 pages, 5 figures, 11 tables. Project: this https URL Code: this https URL

点击查看摘要

Abstract:Croissant has emerged as the metadata standard for machine learning datasets, providing a structured, JSON-LD-based format that makes dataset discovery, automated ingestion, and reproducible analysis machine-checkable across ML platforms. Adoption has accelerated, and NeurIPS now requires Croissant metadata in every submission to its dataset tracks. Yet in practice Croissant generation usually starts with uploading data to a public platform, a path infeasible for governed and large local repositories that hold much of the high-value data ML increasingly relies on. We release Croissant Baker, a local-first, open-source command-line tool that generates validated Croissant metadata directly from a dataset directory through a modular handler registry. We evaluate Croissant Baker on over 140 datasets, scaling to MIMIC-IV at 886 million rows and 374 Parquet files. On held-out comparisons against producer-authored or standards-derived ground truth, Croissant Baker reaches 97-100% agreement across multiple domains.

[IR-3] A Deterministic Agent ic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions

链接: https://arxiv.org/abs/2605.14857
作者: Yu Zhang,Dongjiang Zhuang,Qu Zhou,Zheng Huang,Junhe Wu,Jing Cao,Kai Chen
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in knowledge volume but in multi-dimensional rule reasoning: a correct classification must satisfy competing priority rules along several axes simultaneously, including material, form, function, essential character, the part-versus-whole boundary, and specific listing versus residual headings. End-to-end prompting of large language models fails characteristically by resolving one axis while ignoring the priority constraints on the others. We present a deterministic agentic workflow in contrast to self-planning agents: the control flow is fixed, language model calls are confined to narrow stages, and reflection and verification are retained as local mechanisms. This design yields interpretability by construction–each decision is decomposed into stage-wise structured outputs with verbatim citation of the chapter or section notes that bear on it. The architecture combines offline knowledge-engineering of the Chinese HS tariff with an online six-stage pipeline. Evaluated on HSCodeComp at the six-digit level, the workflow reaches 75.0% top-1 and 91.5% top-3 at four digits, and 64.2% top-1 and 78.3% top-3 at six digits with Qwen3.6-plus; an open-weight Qwen3.6-27B-FP8 backbone in non-thinking mode achieves 84.2% four-digit and 77.4% six-digit top-1 agreement with the frontier model. A two-stage manual audit of 226 six-digit disagreements suggests that a non-trivial fraction of HSCodeComp ground-truth labels may deviate from HS general rules; full adjudication records are released in the appendix as preliminary findings for community review.

[IR-4] Discrimination Is Generation: Unifying Ranking and Retrieval from a Tokenizer Perspective

链接: https://arxiv.org/abs/2605.14853
作者: Shuli Wang,Junwei Yin,Changhao Li,Senjie Kou,Chi Wang,Yinqiu Huang,Yinhua Zhu,Haitao Wang,Xingxing Wang
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Semantic IDs (SIDs) define the generation space of generative recommendation and directly determine its personalization ceiling. However, existing tokenizers are trained independently with retrieval objectives, leaving personalization signals fully decoupled from the SID construction process – a fundamental gap that causes generative retrieval to persistently lag behind discriminative ranking. In this paper, we rethink the essence of SIDs: \emphranking seeks argmax in item space while retrieval seeks argmax in token space; both are the same problem solved at different granularities. Based on this insight, we propose \DIG (\textbfDiscrimination \textbfIs \textbfGeneration), which embeds the tokenizer inside a discriminative ranking model for end-to-end training – the ranker naturally becomes a retrieval model, yielding two models from a single training run. \DIG is organized around a \emphfeature assignment taxonomy: item-intrinsic static features are encoded into SIDs, user-item cross features (u2i) implicitly drive codebook boundaries toward recommendation decision boundaries during training, and an MLP _\mathrmu2t distillation module approximates u2i at the token level for inference. Experiments on three public benchmarks and two industrial datasets demonstrate that \DIG simultaneously improves ranking, retrieval, and unified retrieval-ranking quality.

[IR-5] Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

链接: https://arxiv.org/abs/2605.14665
作者: Joy Bose
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages, 8 figures, 4 tables

点击查看摘要

Abstract:Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work, as is GPU-accelerated inference to address current timeout rates on CPU hardware.

[IR-6] A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval ACL2026

链接: https://arxiv.org/abs/2605.14581
作者: Ho Hung Lim,Yi Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted to Findings of ACL 2026

点击查看摘要

Abstract:Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales, retrieval-optimized embeddings, and multiple mitigation strategies, highlighting significant risks for single-vector visual document retrieval in financial applications.

[IR-7] Asymmetric Generative Recommendation via Multi-Expert Projection and Multi-Faceted Hierarchical Quantization

链接: https://arxiv.org/abs/2605.14512
作者: Bin Huang,Xin Wang,Junwei Pan,Yongqi Zhou,Yifeng Zhou,Zhixiang Feng,Shudong Huang,Haijie Gu,Wenwu Zhu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative Recommendation (GenRec) models reformulate recommendation as a sequence generation task, representing items as discrete Semantic IDs used symmetrically as both inputs and prediction targets. We identify a critical dual-stage information bottleneck in this design: (1) the Input Bottleneck, where lossy quantization degrades fine-grained semantics, while popularity bias skews the learned representations toward frequent items, and (2) the Output Bottleneck, where imprecise discrete targets limit supervision quality. To address these issues, we propose AsymRec, an asymmetric continuous-discrete framework that decouples input and output representations. Specifically, Multi-expert Semantic Projection (MSP) maps continuous embeddings into the Transformer’s hidden space via expert-specialized projections, preserving semantic richness and improving generalization to infrequent items. Multi-faceted Hierarchical Quantization (MHQ) constructs high-capacity, structured discrete targets through multi-view and multi-level quantization with semantic regularization, preventing dimensional collapse while retaining fine-grained distinctions. Extensive experiments demonstrate that AsymRec consistently outperforms state-of-the-art generative recommenders by an average of 15.8 %. The code will be released.

[IR-8] Stop Overthinking: Unlocking Efficient Listwise Reranking with Minimal Reasoning

链接: https://arxiv.org/abs/2605.14450
作者: Danyang Liu,Kan Li
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Listwise reranking utilizing Large Language Models (LLMs) has achieved state-of-the-art retrieval effectiveness. Recently, reasoning-enhanced models have further pushed these boundaries by employing Chain-of-Thought (CoT) to perform deep comparative analysis of candidate documents. However, this performance gain comes at a prohibitive computational cost, as models often generate thousands of reasoning tokens before producing a final ranking. In this work, we investigate the relationship between reasoning length and ranking quality, revealing an overthinking phenomenon where extended reasoning yields diminishing returns. To address this, we propose a Length-Regularized Self-Distillation framework. We synthesize a dataset by sampling diverse reasoning traces from a teacher model (Rank-K) and applying a Pareto-inspired filter to select traces that achieve high ranking performance with minimal token usage. By fine-tuning on these concise, high-quality rationales, the student model learns to internalize efficient reasoning patterns, effectively pruning redundant deliberation. Experiments on TREC Deep Learning and NeuCLIR benchmarks demonstrate that our method maintains the teacher’s effectiveness while reducing inference token consumption by 34%-37% across different retrieval settings, offering a practical solution for deploying reasoning-enhanced rerankers in latency-sensitive applications.

[IR-9] hink When Needed: Adaptive Reasoning -Driven Multimodal Embeddings with a Dual-LoRA Architecture

链接: https://arxiv.org/abs/2605.14448
作者: Longxiang Zhang,Weilong Dai,Guanghao Zhang,Hao Jiang,Pipei Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 30 pages, preprint

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

[IR-10] Efficient Generative Retrieval for E-commerce Search with Semantic Cluster IDs and Expert-Guided RL

链接: https://arxiv.org/abs/2605.14434
作者: Jianbo Zhu,Xing Fang,Jing Wang,Mingmin Jin,Bokang Wang,Guangxin Song,Zhenyu Xie,Junjie Bai
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative retrieval offers a promising alternative by unifying the fragmented multi-stage retrieval process into a single end-to-end model. However, its practical adoption in industrial e-commerce search remains challenging, given the massive and dynamic product catalogs, strict latency requirements, and the need to align retrieval with downstream ranking goals. In this work, we propose a retrieval framework tailored for real-world recall scenarios, positioning generative retrieval as a recall-stage supplement rather than an end-to-end replacement. Our method, CQ-SID (Category-and-Query constrained Semantic ID), employs category-aware and query-item contrastive learning along with Residual Quantized VAEs to encode items into hierarchical semantic cluster identifiers, significantly reducing beam search complexity. Additionally, we develop EG-GRPO (Expert-Guided Group Relative Policy Optimization), a reinforcement learning approach that aligns generative recall with downstream ranking under sparse rewards by injecting ground-truth samples to stabilize training. Offline experiments on TmallAPP search logs show that CQ-SID achieves up to 26.76% and 11.11% relative gains in semantic and personalized click hitrate over RQ-VAE baselines, while halving beam search size. EG-GRPO further improves multi-objective performance. Online A/B tests confirm gains in GMV (+1.15%) and UCTCVR (+0.40%). The generative recall channel now contributes substantially in production, accounting for over 50.25% of exposures, 58.96% of clicks, and 72.63% of purchases, demonstrating a viable path for deploying generative retrieval in real-world e-commerce systems.

[IR-11] owards Self-Evolving Agent ic Literature Retrieval

链接: https://arxiv.org/abs/2605.14306
作者: Yuwen Du,Tian Jin,Jing Kang,Xianghe Pang,Jingyi Chai,Tingjia Miao,Fenyi Liu,WenHao Wang,Sikai Yao,Yuzhi Zhang,Siheng Chen
类目: Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:As large language models reshape scientific research, literature retrieval faces a twofold challenge: ensuring source authenticity while maintaining a deep comprehension of academic search intents. While reliable, traditional keyword-centric search fails to capture complex research intents. Frontier LLMs can handle complex research intents, but their high cost and tendency to hallucinate remain key limitations. Here we introduce PaSaMaster, a self-evolving agentic literature retrieval system that produces relevance-scored paper rankings with evidence-grounded recommendations through iterative intent analysis, retrieval, and ranking. It is built on three key designs. First, it transforms literature retrieval from a one shot query–document matching problem into a search process that evolves over time, using ranked evidence to reveal gaps, refine intents, and guide follow-up searches. Second, it prevents hallucinated sources by treating retrieval as intent–paper relevance ranking rather than generation. Finally, PaSaMaster improves cost efficiency by separating planning from retrieval: a frontier LLM is used only for intent understanding, while large scale retrieval and relevance scoring are delegated to customized corpora and lightweight models. Evaluated on the PaSaMaster Benchmark across 38 scientific disciplines, our system exposes the severe inaccuracy and incompleteness of traditional keyword retrieval (improving F1-score by 15.6X) and the unreliability of generative LLMs (which exhibit hallucination rates up to 37.79%). Remarkably, PaSaMaster outperforms GPT-5.2 by 30.0% at a mere 1% of the computational cost while ensuring zero source hallucination: this https URL

[IR-12] hinking Ahead: Prospection-Guided Retrieval of Memory with Language Models

链接: https://arxiv.org/abs/2605.14177
作者: Harshita Chopra,Krishna Kant Chintalapudi,Suman Nath,Ryen W. White,Chirag Shah
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint

点击查看摘要

Abstract:Long-horizon personalization requires dialogue assistants to retrieve user-specific facts from extended interaction histories. In practice, many relevant facts often have low semanticsimilarity to the query under dense retrieval. Standard Retrieval-Augmented Generation (RAG) and GraphRAG systems are still largely retrospective: they rely on embedding similarity to the query or on fixed graph traversals, so they often miss facts that matter for the user’s needs but lie far from the query in embedding space. Inspired by prospection, the human ability to use imagined futures as cues for recall, we introduce Prospection-Guided Retrieval (PGR), which decouples retrieval from how memories are stored. Given a user query, PGR first expands the goal into a short Tree-of-Thought (ToT) or linear chain of plausible next steps, and uses these steps as retrieval probes rather than relying on the original query alone. The facts retrieved by these probes are then used to personalize the next round of prospection, enabling PGR to uncover additional memories that become relevant only after the simulation is grounded in the user’s history. We also introduce MemoryQuest, a challenging multi-session benchmark in which each query is annotated with 3–5 dated reference facts subject to a low query-reference similarity constraint. Across 1,625 queries spanning 185 user profiles from 3 publicly available datasets, PGR-TOT substantially improves retrieval, including nearly 3x recall on MemoryQuest over the strongest baseline. In pairwise LLM-as-judge comparisons against baselines, PGR-generated responses are preferred on 89–98% of queries, with blinded human annotations on held-out subsets showing the same trend. Overall, the results demonstrate that explicit prospection yields large gains in long-horizon retrieval and response quality relative to similarity-only baselines.

[IR-13] Logging Policy Design for Off-Policy Evaluation

链接: https://arxiv.org/abs/2605.15108
作者: Connor Douglas,Joel Persson,Foster Provost
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Off-policy evaluation (OPE) estimates the value of a target treatment policy (e.g., a recommender system) using data collected by a different logging policy. It enables high-stakes experimentation without live deployment, yet in practice accuracy depends heavily on the logging policy used to collect data for computing the estimate. We study how to design logging policies that minimize OPE error for given target policies. We characterize a fundamental reward-coverage tradeoff: concentrating probability mass on high-reward actions reduces variance but risks missing signal on actions the target policy may take. We propose a unifying framework for logging policy design and derive optimal policies in canonical informational regimes where the target policy and reward distribution are (i) known, (ii) unknown, and (iii) partially known through priors or noisy estimates at logging time. Our results provide actionable guidance for firms choosing among multiple candidate recommendation systems. We demonstrate the importance of treatment selection when gathering data for OPE, and describe theoretically optimal approaches when this is a firm’s primary objective. We also distill practical design principles for selecting logging policies when operational constraints prevent implementing the theoretical optimum.

人机交互

[HC-0] Understanding How International Students in the U.S. Are Using Conversational AI to Support Cross-Cultural Adaptation

链接: https://arxiv.org/abs/2605.15127
作者: Laleh Nourian,Anisa Callis,Stephanie Patterson,Jadeline Miao,Jamison Heard,Garreth W. Tigwell
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 33 pages, single column. 4 figures, 9 tables

点击查看摘要

Abstract:Moving to a new culture and adapting to a new life, as an international student, can be a stressful experience. In the US, international students face unique overlapping challenges, yet the current support ecosystem, including university support systems and informal social networks, remains largely fragmented. While conversational AI has emerged as a tool used by many (e.g., generative AI chatbots like ChatGPT and Google Gemini), we do not have a clear understanding of how international students adopt and perceive these technologies as support tools. We conducted a survey study (n=60) to map the relationship between international students’ challenges and AI adoption patterns, followed by an interview study with 14 participants to identify the underlying motivations and boundaries of use. Our findings show that AI is perceived as a first-aid tool for immediate challenges, however, there is an interest in transforming AI from a tool for short-term help into a long-term support companion. By identifying where and how AI can provide long-term support, and where it is insufficient, we contribute recommendations for creating AI-powered support tailored to the unique needs of international students.

[HC-1] Usable but Conventional: An Empirical Study on the UX of AI-Generated Interface Prototypes

链接: https://arxiv.org/abs/2605.15124
作者: Karoline Romero,Igor Wiese,Renato Balancieiri,Gislaine Camila Leal,Guilherme Guerino
类目: Human-Computer Interaction (cs.HC)
备注: Paper accepted for presentation at SEMISH 2026 - 53rd Integrated Software and Hardware Seminar (Congress of the Brazilian Computer Society - 2026)

点击查看摘要

Abstract:This paper investigates User Experience (UX) with prototypes generated by Generative Artificial Intelligence (GenAI) tools. An empirical survey with 92 participants evaluated AI-generated and human-created prototypes without prior identification of authorship. We measured UX using the UEQ-S, covering pragmatic and hedonic dimensions. Results indicate positive evaluations in pragmatic aspects, such as usability and efficiency, and neutral or negative evaluations in hedonic aspects, including originality and innovation. We concluded that GenAI can produce functional interfaces but tends to reinforce visual and structural patterns that affect perceptions of originality.

[HC-2] After the Interface: Relocating Human Agency in the Age of Conversational AI

链接: https://arxiv.org/abs/2605.15064
作者: Mengke Wu,Mike Yao
类目: Human-Computer Interaction (cs.HC)
备注: 7 pages, 1 figure

点击查看摘要

Abstract:As AI systems take on greater autonomy, a quiet anxiety has settled over the HCI community: human agency is eroding. Users no longer control execution, interfaces recede, and machines decide. We argue that this anxiety, while understandable, reflects a framing problem rather than an empirical finding. Agency has not diminished but has relocated. As interaction has shifted from command- and feature-based paradigms toward conversational, generative, and agentic AI, human agency migrates from interface affordances to interaction itself: articulating goals, evaluating outputs, and negotiating outcomes. To make this relocation visible, we revisit control as a diagnostic lens, distinguish process control and outcome control, and map different systems across this space to show that what looks like agency’s disappearance is actually its redistribution. We take seriously the objection that outcome-based agency may be illusory in systems that produce plausible but unverifiable outputs, and argue that this concern reveals what agency in human-AI interaction truly requires. This paper invites the CUI community to reconsider what agency means, where it lives, and what it demands, including who gets to have it and who holds responsibility when it fails, before the consequences become impossible to overlook.

[HC-3] Deceptive Cookies: Consent by Design – A Mixed Methods Study

链接: https://arxiv.org/abs/2605.15056
作者: Liv Hilde Sjøflot,Tobias A. Opsahl
类目: Human-Computer Interaction (cs.HC)
备注: 21 pages, 12 figures, including appendix

点击查看摘要

Abstract:While companies increasingly rely on data, especially when it comes to targeted advertising, adapting content to users, selling data and training machine learning models, the collection of data raises privacy concerns. One way of collecting data is by using HTTP cookies when interacting with a website. Legal regulations require service providers to collect consent for some forms of cookie collection, which is often acquired through \emphcookie consent banners, but their effectiveness has been debated. This study aims to understand what influences users’ experience and behaviour when managing their cookie consent, by investigating the gap between their stated privacy preferences and their actual actions. A mixed methods approach was used, collecting data from a usability test and a survey (N=20). The results showed that although participants generally want to reject cookie collection, they often end up accepting because of deceptive patterns in the cookie consent banner design. It also showed that they were more willing to consent to websites they trusted and if they expected it would improve their user experience. Although the current EU legislation states that withdrawing consent must be as easy as giving it, withdrawing consent took on average more than 20 times longer than giving it. This suggests that cookie consent banners in their current form are not ideal with respect to user autonomy, often leading users to \emphconsent by design.

[HC-4] Analyzing Codes of Conduct for Online Safety in Video Games at Scale

链接: https://arxiv.org/abs/2605.15047
作者: Jiuming Jiang,Shidong Pan,Daniel W Woods,Jingjie Li
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Online video games have become major online social spaces where users interact, compete, and create together. These spaces, however, expose users to a wide spectrum of online harms, including harassment, discrimination, inappropriate content, privacy breach, cheating, and more. The shape and severity of such harms vary across game design, mechanics, and community context. To mitigate these harms, game companies issue Codes of Conduct (CoCs) that articulate online safety rules and direct players to safety resources. However, it remains unclear how prevalent CoCs are, what safety, security and privacy violations they govern, and whether they meet growing regulatory and industry expectations. We develop and leverage CONDUCTIFY, a pipeline for identifying and analyzing CoCs at scale. Applied to Steam, the largest PC game marketplace, it located the available CoCs for 350 of the 9,586 multiplayer titles on Steam. We found that CoCs are more available among popular, adult-oriented, and community-driven games, while most multiplayer games operate without CoCs despite regulatory and industry recommendations. Although over 80% of the games with CoCs available consistently address traditional security and safety violations, their governance approaches vary substantially across types of violations. A further asymmetry emerges in specificity. Compared with harms related to gameplay mechanics, the articulations of interpersonal harm and the underage player safety are often less specific, despite their relevance to many game communities. Together, these results inform the improvement of online safety governance and CoC enforcement practices, and building better safety infrastructure for the community of players and developers.

[HC-5] Small Private Language Models as Teammates for Educational Assessment Design

链接: https://arxiv.org/abs/2605.15015
作者: Chris Davis Jaldi,Anmol Saini,Shan Zhang,Noah Schroeder,Cogan Shimizu,Eleni Ilkou
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Generative AI increasingly supports educational design tasks, e.g., through Large Language Models (LLMs), demonstrating the capability to design assessment questions that are aligned with pedagogical frameworks (e.g., Bloom’s taxonomy). However, they often rely on subjective or limited evaluation methods; focus primarily on proprietary models; or rarely systematically examine generation, evaluation, or deployment constraints in real educational settings. Meanwhile, Small Language Models (SLMs) have emerged as local alternatives that better address privacy and resource limitations; yet their effectiveness for assessment tasks remains underexplored. To address this gap, we systematically compare LLMs and SLMs for assessment question design; evaluate generation quality across Bloom’s taxonomy levels using reproducible, pedagogically grounded metrics; and further assess model-based judging against expert-informed evaluation by analyzing reliability and agreement patterns. Results show that SLMs achieve competitive performance across key pedagogically motivated quality dimensions while enabling local, privacy-sensitive deployment. However, model-based evaluations also exhibit systematic inconsistencies and bias relative to expert ratings. These findings provide evidence to posit language models as bounded assistants in assessment workflows; underscore the necessity of Human-in-the-Loop; and advance the automated educational question generation field by examining quality, reliability, and deployment-aware trade-offs.

[HC-6] owards Gaze-Informed AI Disclosure Interfaces: Eye-Tracking Attentional and Cognitive Load While Reading AI-Assisted News

链接: https://arxiv.org/abs/2605.14999
作者: Pooja Prajod,Hannes Cools,Thomas Röggla,Pablo Cesar,Abdallah El Ali
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As generative AI becomes increasingly integrated into journalism, designing effective AI-use disclosures that inform readers without imposing unnecessary burden is a key challenge. While prior research has primarily focused on trust and credibility, the impact of disclosures on readers’ attentional and cognitive load remains underexplored. To address this gap, we conducted a 3\times2\times2 mixed factorial study manipulating the level of AI-use disclosure detail (none, one-line, detailed), news type (politics, lifestyle), and role of AI (editing, partial content generation), measuring load via NASA-TLX and eye-tracking. Our results reveal a significant attentional cost: one-line disclosures resulted in significantly higher fixation durations and saccade counts, particularly for AI-edited content. Detailed disclosures did not impose additional burden. Drawing on Information-Gap Theory, we argue that brief labels may trigger increased visual scrutiny by alerting readers to AI use without providing enough information. NASA-TLX scores and pupil diameter showed no significant differences across conditions, suggesting that AI-use disclosures do not impose cognitive burden regardless of the detail level. Interview insights contextualize these findings and reveal a strong preference for detailed or ``detail-on-demand’’ designs. Our findings inform the design of gaze-informed adaptive disclosure interfaces that dynamically adjust transparency levels based on readers’ attentional patterns and news context.

[HC-7] Viverra: Text-to-Code with Guarantees

链接: https://arxiv.org/abs/2605.14972
作者: Haoze Wu,Rocky Klopfenstein,Keith Farkas,Nina Narodytska
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Logic in Computer Science (cs.LO)
备注:

点击查看摘要

Abstract:A fundamental limitation of Text-to-Code is that no guarantee can be obtained about the correctness of the generated code. Therefore, to ensure its correctness, the generated code still has to be reviewed, tested, and maintained by developers. However, parsing through LLM-generated code can be tedious and time-consuming, potentially negating the productivity gains promised by AI-coding tools. To address this challenge, we present Viverra, a system that automatically produces formally verified annotations alongside generated code to aid user’s understanding of the generated program. Given a natural-language task description, Viverra prompts an LLM to synthesize a C program together with candidate assertions expressing safety and correctness properties. It then verifies those assertions in a compositional and best-effort manner via a portfolio of bounded model checkers. Evaluation on 18 diverse programming tasks suggests that Viverra can efficiently generate code with verified assertions, and that these assertions improve users’ performance on code-comprehension tasks in a user study with more than 400 participants.

[HC-8] From Sycophantic Consensus to Pluralistic Repair: Why AI Alignment Must Surface Disagreement

链接: https://arxiv.org/abs/2605.14912
作者: Varad Vishwarupe,Nigel Shadbolt,Marina Jirotka
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pluralistic alignment is typically operationalised as preference aggregation: producing responses that span (Overton), steer toward (Steerable), or proportionally represent (Distributional) diverse human values. We argue that aggregation alone is an incomplete primitive for deployed pluralistic alignment. Under genuine value pluralism, the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus: a learned tendency to agree with, validate, and minimise friction with the immediate interlocutor. Because deployed AI systems now mediate consequential deliberation across health, civic life, labour, and governance, the collapse of disagreement at the interaction layer is not a narrow technical concern but a structural failure with distributive consequences. We reframe pluralistic alignment around three conversational mechanisms drawn from Grice’s maxims: scoping (acknowledging the limits of one’s perspective), signalling (surfacing value-conflict rather than smoothing it over), and repair (revising one’s position on principled grounds, not on user pressure). We formalise a metric, the Pluralistic Repair Score (PRS), distinguishing principled revision from capitulation, and present a small-scale empirical illustration on two frontier RLHF-trained models (Claude Sonnet 4.5, N=198; GPT-4o, N=100) showing that, for both, agreement-following coexists with low repair-quality on contested-value prompts. PRS measures an interactional precondition for pluralism (visible disagreement; principled revision) rather than pluralism in full; we discuss the difference, take seriously the reflexive question of whose “principled” counts, and argue that pluralism is most decisively made or unmade at the deployment-governance layer: interfaces, preference-data pipelines, and audit infrastructure.

[HC-9] Beliefs and Misconceptions around Integrated Conversational AI

链接: https://arxiv.org/abs/2605.14849
作者: William Seymour,Adam Jenkins,Mark Cote,Jose Such
类目: Human-Computer Interaction (cs.HC)
备注: Accepted to ACM CUI '26

点击查看摘要

Abstract:LLM-driven conversational AI is beginning to disappear into the background, shifting from something used directly towards something increasingly integrated into existing workflows. In the process, markers of origin and training are smoothed away as LLMs become commodified in the eyes of users. We explore how people approach using a web browser with conversational AI built in, focusing on how they develop their understanding and determine whether to trust its outputs. We conducted a study where 20 participants used the Copilot AI features in Microsoft Edge to conduct information retrieval and planning tasks. Participants relied on a combination of existing perceptions of LLMs and internet search, tracing the effect of beliefs about how Copilot generated answers on prompting strategies. The inclusion of citations increased the trustworthiness of answers without participants feeling the need to be check them, with participants often reaching for the same information sources as the CAI when fact-checking.

[HC-10] Emotion-Attended Stateful Memory (EASM):The Architecture for Hyper-Personalization at Scale

链接: https://arxiv.org/abs/2605.14833
作者: Vineet Kotecha,Vansh Gupta
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 18 pages, 3 figures, 3 tables. Industry research whitepaper. Includes controlled A/B evaluation across 30 scenarios and 6 emotional categories

点击查看摘要

Abstract:Current language model systems remain fundamentally stateless across sessions, limiting their ability to personalize interactions over time. While retrieval-augmented generation and fine-tuning improve knowledge access and domain capability, they do not enable persistent understanding of individual users. We propose an emotion-attended stateful memory architecture that dynamically constructs user-specific conversational context using long-term history, emotional signals, and inferred intent at inference time. To evaluate its impact, we conducted a controlled A/B study across thirty non-scripted conversations spanning six emotionally distinct categories using the same underlying language model in both conditions. The memory-enriched condition consistently outperformed the stateless baseline across all evaluated scenarios. The largest gains were observed in memory grounding (95% improvement), plan clarity (57%), and emotional validation (34%). Results remained consistent even in emotionally adversarial conversations involving grief, distress, and uncertainty. These findings suggest that stateful emotional memory may represent a foundational infrastructure layer for hyper-personalized AI systems, though broader validation across larger and more diverse evaluations remains necessary

[HC-11] Agent ic AI and Human-in-the-Loop Interventions: Field Experimental Evidence from Alibabas Customer Service Operations

链接: https://arxiv.org/abs/2605.14830
作者: Yiwei Wang,Chuan Zhu,Tianjun Feng,Lauren Xiaoyuan Lu,Bingxin Jia
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Agentic AI systems that autonomously perform service tasks are entering customer service operations. However, limited evidence exists on how human interventions shape service outcomes when agentic AI failures create both cognitive and emotional consequences. We study this issue through a randomized field experiment on Alibaba’s Taobao platform. Workers in the treatment condition supervised an agentic AI system that resolved AI-eligible chats while continuing to handle AI-ineligible chats, whereas control workers resolved all chats without agentic AI. The findings show that AI deployment reduces average chat duration and has limited effects on retrial rates, but substantially lowers ratings for AI-eligible chats. Moreover, human intervention effectiveness in AI-eligible chats depends on the nature of AI failure, post-escalation intervention effort, and intervention timing. Human intervention preserves service quality in algorithm-triggered technical escalations, i.e., unresolved customer issues beyond the AI’s capability, but is less effective in algorithm-triggered emotional escalations, i.e., where customers express frustration or dissatisfaction. These differences are partly explained by variation in workers’ post-escalation intervention effort across escalation types. In algorithm-triggered emotional escalations, workers showed lower engagement: they sent fewer messages, contributed a smaller share of total chat rounds, and showed less proactivity in information seeking and solution provision. We further find that early intervention is essential for sustaining high post-escalation intervention effort. Finally, we document a positive spillover effect on AI-ineligible chats, as treated workers adapted their multitasking workflow to devote greater attention to these chats. These findings offer implications for human-in-the-loop process design in human-AI collaboration systems.

[HC-12] Known By Their Actions: Fingerprinting LLM Browser Agents via UI Traces

链接: https://arxiv.org/abs/2605.14786
作者: William Lugoloobi,Samuelle Marro,Jabez Magomere,Joss Wright,Chris Russell
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As LLM-based agents increasingly browse the web on users’ behalf, a natural question arises: can websites passively identify which underlying model powers an agent? Doing so would represent a significant security risk, enabling targeted attacks tailored to known model vulnerabilities. Across 14 frontier LLMs and four web environments spanning information retrieval and shopping tasks, we show that an agent’s actions and interaction timings, captured via a passive JavaScript tracker, are sufficient to identify the underlying model with up to 96% F1. We formalise this attack surface by demonstrating that classifiers trained on agent actions generalise across model sizes and families. We further show that strong classifiers can be trained from few interaction traces and that agent identity can be inferred early within an episode. Injecting randomised timing delays between actions substantially degrades classifier performance, but does not provide robust protection: a classifier retrained on delayed traces largely recovers performance. We release our harness and a labelled corpus of agent traces \hrefthis https URLhere.

[HC-13] AI Outperforms Humans in Personalized Image Aesthetics Assessment via LLM -Based Interviews and Semantic Feature Extraction

链接: https://arxiv.org/abs/2605.14761
作者: Yoshia Abe,Tatsuya Daikoku,Yasuo Kuniyoshi
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 25 pages, 13 figures

点击查看摘要

Abstract:Accurately predicting individual aesthetic evaluation for images is a fundamental challenge for AI. Various deep learning (DL)-based models have been proposed for this task, training on image evaluation data to extract objective low-level features. However, aesthetic preferences are inherently subjective and individual-dependent. Accurate prediction thus requires the extraction of high-level semantic features of images and the active collection of preference information from the target individual. To address this issue, we focus on the utility of Large Language Models (LLMs) pretrained on vast amounts of textual data, and develop an integrated DL-LLM system. The system actively elicits aesthetic preferences through LLM-based semi-structured interviews and predicts aesthetic evaluation by leveraging both low-level and high-level features. In our experiments, we compare the proposed system against conventional systems, human predictors, and the target individual’s own re-evaluations after a certain time interval. Our results show that the proposed system outperforms all of them, with particularly strong performance on highly-rated images. Moreover, the prediction error of the proposed system is smaller than within-person variability, while human predictors show the largest error, likely due to the influence of their own aesthetic values. These results suggest that AI may be better positioned than others or one’s future self to capture individual aesthetic preferences at a given point. This opens a new question of whether AI could serve as a deeper interpreter of human aesthetic sensibility than humans themselves.

[HC-14] SmartWalkCoach: An AI Companion for End-to-End Walking Guidance Motivation and Reflection

链接: https://arxiv.org/abs/2605.14628
作者: Xianzhe Zhang,Mingxuan Hu,Bufan Xue,Erick Purwanto,Thomas J Selig,Daniel Yonto
类目: Human-Computer Interaction (cs.HC)
备注: 15 pages, 2 figures, to be presented to ACM IUI 2026

点击查看摘要

Abstract:We present SmartWalkCoach, a mobile AI companion that supports the full walking journey: from pre-walk planning to in-walk guidance through to post-walk reflection. Addressing a gap between map navigation and motivational coaching, SmartWalkCoach orchestrates three lightweight agents: (1) GeographyAgent for conversational route curation from nearby points of interest and user preferences while delegating pathfinding to map APIs; (2) AccompanyAgent for context-aware, just-in-time prompts that blend informational cues with relational encouragement; and (3) SummaryAgent for concise reflection and next-step planning. This end-to-end, tool-using design aims to lower cognitive load in planning and sustain engagement and motivation during walking through delivering dynamic, cadence-aware interventions. We conducted an in-the-wild, two-period AB/BA crossover study (N=12), where each participant completed two comparable walks with counterbalanced conditions: Information-only versus Information+Motivation. Linear mixed models show that adding motivational, companion-like dialogue significantly improved outcomes: participants reported higher positive feelings and better user experience, with no evidence of carryover. Thematic analysis surfaced two design imperatives for mobile companions: supportive, relational expression and context-aware timing (e.g., avoiding high-load moments, intervening at fatigue/milestones). Our contributions are: (i) an end-to-end, tool-using agent architecture for everyday walking that reduces cognitive load during planning and accompaniment; (ii) a controlled field evaluation linking context-aware motivation to affect and UX gains; and (iii) actionable design guidance on expression, timing, and frequency for mHealth this http URL outline limitations and paths toward multimodal, voice-first companions, with adaptive personalization mechanisms.

[HC-15] Sycophancy is an Educational Safety Risk: Why LLM Tutors Need Sycophancy Benchmarks

链接: https://arxiv.org/abs/2605.14604
作者: Enkelejda Kasneci,Gjergji Kasneci
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:This position paper argues that effective tutoring requires corrective friction: surfacing misconceptions and challenging them supportively to drive conceptual change. Yet preference-aligned LLMs can trade epistemic rigor for agreeableness. We identify a Reasoning-Sycophancy Paradox: models that resist context-switch frame attacks can still capitulate under social-epistemic pressure, especially authority (“my notes say I’m right”) and social-affective face-saving (“please don’t tell me I’m wrong”). We introduce EduFrameTrap, a tutoring benchmark across math, physics, economics, chemistry, biology, and computer science that varies student confidence and pressure (context-switch, authority, social-affective). Across two frontier LLMs, context-switch failures are comparatively lower for GPT-5.2, while authority and social pressure more often trigger epistemic retreat. In contrast, Claude shows substantial context-switch fragility in this run. Because these failures are hard to judge automatically, we report two-judge disagreement as a reliability signal. We argue benchmarks should measure social-epistemic courage, i.e., supportive but corrective tutoring, and treat kind-but-correct behavior as a safety requirement.

[HC-16] Physics-Based iOCT Sonification for Real-time Interaction Awareness in Subretinal Injection

链接: https://arxiv.org/abs/2605.14500
作者: Luis D. Reyes Vargas,Veronica Ruozzi,Andrea K. M. Ross,Shervin Dehghani,Michael Sommersperger,Koorosh Faridpooya,Mohammad Ali Nasseri,Merle Fairhurst,Nassir Navab,Sasan Matinfar
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Subretinal injection is a delicate vitreoretinal procedure requiring precise needle placement within the subretinal space while avoiding perforation of the retinal pigment epithelium (RPE), a layer directly beneath the target with extremely limited regenerative capacity. To enhance depth perception during cannula advancement, intraoperative optical coherence tomography (iOCT) offers high-resolution cross-sectional visualization of needle-tissue interaction; however, interpreting these images requires sustained visual attention alongside the en face microscope view, thereby increasing cognitive load during critical phases and placing additional demands on the surgeon’s proprioceptive control. In this paper, we propose a structured, real-time sonification framework designed for extensible mapping of iOCT-derived anatomical features into perceptual auditory feedback. The method employs a physics-inspired acoustic model driven by segmented retinal layers from a stream of iOCT B-scans, with needle motion and injection-induced retinal layer displacements serving as excitation inputs to the sound model, enabling perception of tool position and retinal deformation. In a controlled user study (n=34), the proposed sonification achieved high retinal layer identification accuracy and robust detection of retinal deformation-related events, significantly outperforming a state-of-the-art baseline in overall event identification (83.4% vs. 60.6%, p 0.001), with gains driven primarily by enhanced detection of injection-induced retinal deformation. Evaluation by experts (n=4) confirmed the clinical relevance and potential intraoperative applicability of the method. These results establish structured iOCT sonification as a viable complementary modality for real-time surgical guidance in subretinal injection.

[HC-17] Automated Curriculum Design for High-dimensional Human Motor Learning

链接: https://arxiv.org/abs/2605.14367
作者: Ankur Kamboj,Rajiv Ranganathan,Xiaobo Tan,Vaibhav Srivastava
类目: ystems and Control (eess.SY); Human-Computer Interaction (cs.HC); Optimization and Control (math.OC)
备注:

点击查看摘要

Abstract:Designing effective practice schedules for high-dimensional motor learning tasks remains a challenge, especially when skill states are unobservable and task performance may not reflect the true learning. We propose an automated curriculum design framework that combines a human motor learning model and personalized real-time skill estimation with Stochastic Nonlinear Model Predictive Control in \emphde-novo (novel) motor learning paradigms. We validated our framework both through simulations and human-subject studies (N = 36) using a hand exoskeleton. Our proposed approach accelerates skill acquisition by \sim23% , and \sim17% when compared to a random curriculum and a performance heuristics-based curriculum, respectively. These significant gains in learning efficiency highlight the potential of model-based, individualized curricula for motor rehabilitation and complex skill training.

[HC-18] A Formative Study of Brief Affective Text as a Complement to Wearable Sensing for Longitudinal Student Health Monitoring

链接: https://arxiv.org/abs/2605.14360
作者: Tamunotonye Harry,Johanna Hidalgo,Matthew Price,Yuanyuan Feng,Kathryn Stanton,Connie Tompkins,Peter Sheridan Dodds,Mikaela Irene Fudolig,Laura Bloomfield,Christopher Danforth
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: Submitted to ACM IMWUT

点击查看摘要

Abstract:Wearable devices capture physiological and behavioral data with increasing fidelity, but the psychological context shaping these outcomes is difficult to recover from sensor data alone, limiting passive sensing utility for digital health. We examined whether ultra-brief naturalistic concern text could serve as a scalable complement to passive sensing. In a year-long study of 458 university students (3,610 person-waves) tracked with Oura rings, participants responded bimonthly to an open-ended prompt about what concerned them most; responses had a median length of three words. We compared dictionary-based, general pretrained, and domain-adapted NLP approaches using within-person mixed-effects models across nine sleep and physical activity outcomes. Weeks dominated by academic concern framing were associated with lower physical activity; weeks characterized by emotional exhaustion language were associated with poorer sleep quality and lower heart rate variability. General pretrained embeddings outperformed domain-adapted models for most outcomes, with domain adaptation showing relative advantage for autonomic outcomes. Zero-shot classification of concern topics produced no significant associations, while affective dimensions across all three methods were consistently associated with outcomes, indicating emotional register rather than topical content carries the signal. These findings offer design guidance: ultra-brief affective prompts enrich the psychological interpretability of passive physiological data at minimal burden.

[HC-19] Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

链接: https://arxiv.org/abs/2605.14311
作者: Yuchen Sun,Pei Fu,Shaojie Zhang,Anan Du,Xiuwen Xi,Ruoceng Zhang,Zhenbo Luo,Jian Luan,Chongyang Zhang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 28 pages including appendix. Code and BBBench benchmark to be released

点击查看摘要

Abstract:Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic’s fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse–the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity–binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

[HC-20] Distill: Uncovering the True Intent behind Human-Robot Communication

链接: https://arxiv.org/abs/2605.14262
作者: Ting Li,David Porfirio
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 17 pages

点击查看摘要

Abstract:As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.

[HC-21] Self-Regulated Learning in Essay Writing: Consistency of Strategies and Impact on Outcomes

链接: https://arxiv.org/abs/2605.14228
作者: Gloria Fernández-Nieto,Kiyoshige Garcés,Mladen Raković,Tongguang Li,Xinyu Li,Linxuan Zhao,Dragan Gašević
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 16 pages, 4 figures, submitted to Journal of Computer Assisted Learning (JCAL) [Under Review]

点击查看摘要

Abstract:Background: Abilities for effective self-regulated learning (SRL) are critical for lifelong learning, particularly during adolescence when these skills consolidate and strongly influence future learning. Their importance has grown with the rise of online and blended education. Yet, little is known about how secondary school students self-regulate in online environments, how their SRL processes and strategies evolve, or how they affect outcomes. In secondary education, understanding these processes can reveal patterns and indicators of learning success, informing the design of online support mechanisms. Evidence from repeated-measures designs remains scarce. Objectives: This study aims to examine how secondary school students enact SRL strategies during online essay writing, how these strategies change over time, and how they relate to learning outcomes. Methods: We analysed metacognition-related trace data collected from secondary students during a two-wave online essay-writing task conducted one week apart in two Colombian schools (N = 93 for session 1, N = 95 for session 2) via a digital learning platform. Using a combination of process mining and unsupervised machine learning techniques, we identified dominant SRL strategies grounded in established SRL processes and examined their stability and association with learning outcomes. Results and conclusions: Three dominant SRL strategies were identified. Results showed variability: many students remained in or shifted to Read first, write next, while none used Write intensively, read selectively in session 2. Although less common, latter strategy was positively associated with learning outcomes.

[HC-22] What Should Explanations Contain? A Human-Centered Explanation Content Model for Local Post-Hoc Explanations

链接: https://arxiv.org/abs/2605.14207
作者: Helmut Degen
类目: Human-Computer Interaction (cs.HC)
备注: paper has 36 pages (without references), 2 figures, and 2 tables; appendix has 165 pages; submitted to International Journal of Human-Computer Studies

点击查看摘要

Abstract:Which categories of explanation content are relevant for users of industrial AI systems, and how can those categories be organized for local, post-hoc explanations? To address these questions, a hybrid inductive-deductive qualitative content analysis was applied to 325 meaning units drawn from six user studies in building technology, manufacturing, AI software development, and hospital cybersecurity. The inductive phase produced an initial twelve-code structure. A theory-informed coverage assessment and expert review then added two further codes, Rule base and What-if backward, that were not instantiated in the corpus but correspond to system architectures documented in the XAI literature. The resulting fourteen-code model is organized into four groups: rule-based, causal, epistemic (actual), and epistemic (similar), with twelve codes grounded in the corpus and two as theoretical extensions. An eleven-member expert panel supported the content adequacy of all codes (I-CVI \geq 0.82; scale-level agreement of 0.93 for relevance, 0.92 for boundary clarity, and 0.94 for understandability). A stratified subsample of 82 units (25% of the corpus), coded independently by two researchers using the finalized codebook, yielded Krippendorff’s \alpha = 0.920 and Cohen’s \kappa = 0.920 . The paper therefore establishes content adequacy and coding reproducibility for a content-level explanation model intended to support elicitation, specification, and later evaluation of explanation content in industrial AI systems. Behavioral validation of downstream effects remains future work.

[HC-23] Pluot: Towards write once run everywhere visualization software

链接: https://arxiv.org/abs/2605.14118
作者: Mark S. Keller,Nils Gehlenborg
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Tools used for implementing visualization software systems can generally be divided into camps such as static versus interactive and desktop versus web-based. We contribute Pluot, an architecture that bridges these divides, enabling a single software implementation of a visualization to be used regardless of the target level of interactivity or computing environment. With Pluot, a visualization developer implements a given visualization rendering function once, using the Rust programming language. Then, bindings to the Rust program can be generated to enable reproducible execution of the rendering function from other languages, such as Python or JavaScript. Pluot can render visualizations to bitmap or vector graphics format, bridging gaps between interactive performance and publication-quality figure creation. The software is available at this https URL.

[HC-24] Modeling Bounded Rationality in Drug Shortage Pharmacists Using Attention-Guided Dynamic Decomposition

链接: https://arxiv.org/abs/2605.14111
作者: Yaniv Eliyahu Amiri,Noah Chicoine,Jacqueline Griffin,Stacy Marsella
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at CogSci 2026. 6 pages plus references, 1 figure, 2 tables

点击查看摘要

Abstract:Hospital pharmacists make high-stakes decisions to mitigate drug shortages under uncertainty, time pressure, and patient risk. Interviews revealed that pharmacists focus attention on a small subset of drugs, limiting cognitive effort to the most urgent cases. Motivated by these findings, we formalize a bounded-rational, attention-guided decision framework that dynamically decomposes drugs into a subset for high-cost reasoning and a complementary subset for low-cost monitoring. We develop two agents: an Expert Agent that applies attention weights derived from pharmacist interviews, and a Learner Agent that adapts attention allocation over time through experience. Across simulated scenarios spanning short to long horizons, we show that attention-guided planning supports stable decision-making without complete state reasoning. These results suggest that a primary decision is not what action to take, but where to allocate cognitive effort, and that attention-guided, satisficing strategies can reduce problem complexity while maintaining stable performance.

[HC-25] Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task

链接: https://arxiv.org/abs/2605.14097
作者: Aaron Parisi,Nithum Thain,Alden Hallak,Vivian Tsai,Crystal Qian
类目: Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:As large language models (LLMs) evolve from single-user assistants to active participants in civic and workplace deliberation, evaluating their effects on collective decision making becomes a governance challenge. We present two empirical studies (N=879) of real-time, text-based group deliberation in an incentive-compatible charity allocation task with real financial stakes ( 7,200 USD). Groups of three allocate a donation budget under varying LLM facilitation conditions: Study 1 (N=204) compares three frontier models; Study 2 (N=675) compares facilitator strategies against a no-facilitation baseline. Across both studies, LLM facilitation did not significantly improve group consensus in either study, yet participants consistently preferred facilitated discussion. We additionally identify two governance-relevant risks. First, algorithmic steering: facilitators shifted select charity-level allocations by up to 5.5 percentage points – directly affecting the final charitable payout – even when aggregate agreement metrics remained unchanged. Second, an illusion of inclusion: participants cited inclusivity as their primary reason for preferring LLM facilitators, yet neither survey nor transcript-based measures of participation equity improved. Notably, participants reported greater trust in the process under the same conditions where facilitators exerted directional influence on outcomes. Together, these findings show that in AI-mediated group deliberation, perceived procedural improvement can coexist with measurable steering and unchanged participation inequality, motivating evaluation practices that treat collective outcomes, interaction dynamics, and participant perceptions as distinct governance targets.

[HC-26] Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

链接: https://arxiv.org/abs/2605.13930
作者: William Lehn-Schiøler,Magnus Ruud Kjær,Rahul Thapa,Magnus Guldberg Pedersen,Anton Storgaard Mosquera,Nick Williams,Radu Gatej,Tue Lehn-Schiøler,Sándor Beniczky,Sadasivan Puthusserypady,James Zou,Lars Kai Hansen
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC); Neural and Evolutionary Computing (cs.NE)
备注: Preprint. 14 pages, 7 figures, 4 tables

点击查看摘要

Abstract:EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a “target vs. off-target” probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: “wrecking-ball” interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and \alpha -band restoration.

[HC-27] Large Language Models for Web Accessibility: A Systematic Literature Review

链接: https://arxiv.org/abs/2605.13873
作者: Wajdi Aljedaani,Rubel Hassan Mollik
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the 23rd International Web for All Conference (W4A 2026)

点击查看摘要

Abstract:Web accessibility aims to ensure that web content and services are usable by people with diverse abilities. In recent years, Large Language Models (LLMs) have been increasingly explored to support accessibility-related tasks on the web, such as content generation, issue detection, and remediation. However, little is known about the characteristics of these approaches, the accessibility issues they target, the standards they follow, and how they are evaluated. In this paper, we present a systematic literature review of 38 peer-reviewed studies that investigate the use of LLMs in web accessibility contexts. We begin by performing a comprehensive search of scientific publications to identify relevant studies. We then conduct a comparative analysis to examine the accessibility tasks addressed, the LLM models and prompting strategies employed, the system architectures adopted, the accessibility issues and guidelines considered, and the evaluation methods used across studies. Our findings show that most studies apply LLMs to text-centric and structurally explicit accessibility tasks, with WCAG serving as the primary reference framework and limited consideration of cognitive accessibility guidelines (COGA). The reviewed approaches predominantly rely on general-purpose LLMs and prompt-based interactions, while evaluation practices vary widely and often lack direct involvement of users with disabilities. We envision this review as a consolidated reference for researchers and practitioners seeking to understand the current landscape of LLM-supported web accessibility, and as a foundation to guide future research and tool development in this area.

[HC-28] nASR: An End-to-End Trainable Neural Layer for Channel-Level EEG Artifact Subspace Reconstruction in Real-Time BCI

链接: https://arxiv.org/abs/2605.14941
作者: Shantanu Sarkar,Jose L. Contreras-Vidal
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Preprint. Submitted to IEEE SMC 2026 (under review)

点击查看摘要

Abstract:Electroencephalogram (EEG) signals are highly susceptible to artifacts, resulting in a low signal-to-noise ratio which makes extraction of meaningful neural information challenging. Artifact Subspace Reconstruction (ASR) is one of the most widely used artifact filtering techniques in EEG-based BCI applications, owing to its real-time applicability. ASR reconstructs artifact-free signals by operating in Principal Component (PC) space within sliding windows. However, ASR performance is critically sensitive to its threshold parameter - an incorrect threshold risks removing task-relevant neural features alongside artifacts. Furthermore, since PCs are linear combinations of all channels, subspace reconstruction in PC space may alter the underlying data structure, potentially discarding essential neural information. To address these limitations, we propose nASR, a novel end-to-end trainable Keras layer that jointly optimizes artifact rejection and downstream decoding. nASR introduces two trainable threshold parameters: K, which governs artifact detection in PC variance space, and L, which quantifies eigen-spread to pinpoint the primary artifact–contributing channels, enabling selective channel-level reconstruction that preserves clean channel information. An ablation study comprising five model variants (m01 - m05), evaluated across two subjects from the BCI Competition IV Dataset 1, confirms that nASR variants consistently outperform traditional ASR on test classification metrics, while achieving a 6-8x reduction in inference time, making nASR a strong candidate for real-time BCI applications demanding both low latency and high decoding performance. Comments: Preprint. Submitted to IEEE SMC 2026 (under review) Subjects: Signal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2605.14941 [eess.SP] (or arXiv:2605.14941v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2605.14941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[HC-29] BCI-Based Assessment of Ocular Response Time Using Dynamic Time Warping Leverag ing an RDWT-Driven Deep Neural Framework

链接: https://arxiv.org/abs/2605.14883
作者: Shantanu Sarkar,Sai Shashank Gandavarapu,Jeff Feng,Saurabh Prasad,Reza Khanbabaie,Jose L. Contreras-Vidal
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Submitted to IEEE SMC 2026 (under review)

点击查看摘要

Abstract:Mild traumatic brain injury (mTBI) is a prevalent condition that remains difficult to diagnose in its early stages. Oculomotor dysfunction is a well-established marker of mTBI, motivating the development of portable tools that capture both eye-movement behavior and underlying neurophysiology. In this work, we present an initial framework that integrates electroencephalogram (EEG) with augmented-reality (AR)-based Vestibular/Ocular Motor Screening (VOMS) tasks to estimate subject-specific ocular response times. Pre-processed EEG signals, obtained through band-pass filtering and average referencing, are analyzed using a Redundant Discrete Wavelet Transform (RDWT)-driven deep neural framework. The RDWT coefficients are subjected to trainable zero-phase convolutional filtering and reconstructed into the time domain via inverse RDWT, followed by channel-wise temporal and spatial filtering using 2D convolution layers and convolutional-LSTM-based decoding. An ablation study demonstrates that wavelet-domain filtering serves as an effective denoising strategy, improving prediction performance. Sliding-window predictions were validated using Pearson correlation (= 0.5), and Dynamic Time Warping (DTW) was subsequently used to estimate ocular response times. DTW-derived metrics revealed significant inter-subject differences across all VOM tasks, supported by Mann-Whitney U tests. Cross-correlation analysis further revealed task-dependent temporal behaviors: pursuit tasks exhibited reactive tracking, whereas saccades showed anticipatory responses. Overall, the results highlight pursuit tasks as particularly informative for distinguishing timing differences and demonstrate the potential of RDWT-based EEG features combined with DTW metrics for multimodal mTBI assessment.

计算机视觉

[CV-0] EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation

链接: https://arxiv.org/abs/2605.15199
作者: Ruozhen He,Meng Wei,Ziyan Yang,Vicente Ordonez
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring. As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen’s d = +2.33) and presence among methods evaluated. Code and data are available at this https URL.

[CV-1] RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

链接: https://arxiv.org/abs/2605.15196
作者: Xiang Fan,Yuheng Wang,Bohan Fang,Zhongzheng Ren,Ranjay Krishna
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.

[CV-2] VGGT-Ω CVPR2026

链接: https://arxiv.org/abs/2605.15195
作者: Jianyuan Wang,Minghao Chen,Shangzhan Zhang,Nikita Karaev,Johannes Schönberger,Patrick Labatut,Piotr Bojanowski,David Novotny,Andrea Vedaldi,Christian Rupprecht
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 (Oral)

点击查看摘要

Abstract:Recent feed-forward reconstruction models, such as VGGT, have proven competitive with traditional optimization-based reconstructors while also providing geometry-aware features useful for other tasks. Here, we show that the quality of these models scales predictably with model and data size. We do so by introducing VGGT- \Omega , which substantially improves reconstruction accuracy, efficiency, and capabilities for both static and dynamic scenes. To enable training this model at an unprecedented scale, we introduce architectural changes that improve training efficiency, a high-quality data annotation pipeline that supports dynamic scenes, and a self-supervised learning protocol. We simplify VGGT’s architecture by using a single dense prediction head with multi-task supervision and removing the expensive high-resolution convolutional layers. We also use registers to aggregate scene information into a compact representation and introduce register attention, which restricts inter-frame information exchange to these registers, in part replacing global attention. In this way, during training, VGGT- \Omega uses only about 30% of the GPU memory of its predecessor, allowing us to train with 15x more supervised data than prior work and to leverage vast amounts of unlabeled video data. VGGT- \Omega achieves strong results for reconstruction of static and dynamic scenes across multiple benchmarks, for example, improving over the previous best camera estimation accuracy on Sintel by 77%. We also show that the learned registers can improve vision-language-action models and support alignment with language, suggesting that reconstruction can be a powerful and scalable proxy task for spatial understanding. Project Page: this http URL

[CV-3] Aligning Latent Geometry for Spherical Flow Matching in Image Generation

链接: https://arxiv.org/abs/2605.15193
作者: Tuna Han Salih Meral,Kaan Oktay,Hidir Yesiltepe,Adil Kaan Akan,Pinar Yanardag
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Latent flow matching for image generation usually transports Gaussian noise to variational autoencoder latents along linear paths. Both endpoints, however, concentrate in thin spherical shells, and a Euclidean chord leaves those shells even when preprocessing aligns their radii. By decomposing each latent token into radial and angular components, we show through component-swap probes that decoded perceptual and semantic content is carried predominantly by direction, with radius contributing much less. We therefore project data latents onto a fixed token radius, use the radial projection of Gaussian noise as the spherical prior, finetune the decoder with the encoder frozen, and replace linear interpolation with spherical linear interpolation. The resulting geodesic paths stay on the sphere at every timestep, and their velocity targets are purely angular by construction. Under matched training, the method consistently improves class-conditional ImageNet-256 FID across different image tokenizers, leaves the diffusion architecture unchanged, and requires no auxiliary encoder or representation-alignment objective.

[CV-4] RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

链接: https://arxiv.org/abs/2605.15190
作者: Yanzuo Lu,Ronglai Zuo,Jiankang Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Causal autoregressive video diffusion models support real-time streaming generation by extrapolating future chunks from previously generated content. Distilling such generators from high-fidelity bidirectional teachers yields competitive few-step models, yet a persistent gap between the history distributions encountered during training and those arising at inference constrains generation quality over long horizons. We introduce the Real-time Autoregressive Video Extrapolation Network (RAVEN), a training-time test framework that repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states. This formulation aligns training attention with inference-time extrapolation and allows downstream chunk losses to supervise the history representations on which future predictions depend. We further propose Consistency-model Group Relative Policy Optimization (CM-GRPO), which reformulates a consistency sampling step as a conditional Gaussian transition and applies online Reinforcement Learning (RL) directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations. Experiments demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations, and that CM-GRPO provides further gains when combined with RAVEN.

[CV-5] Articraft: An Agent ic System for Scalable Articulated 3D Asset Generation ICRA

链接: https://arxiv.org/abs/2605.15187
作者: Matt Zhou,Ruining Li,Xiaoyang Lyu,Zhaomou Song,Zhening Huang,Chuanxia Zheng,Christian Rupprecht,Andrea Vedaldi,Shangzhe Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Robotics (cs.RO)
备注: Project page: this https URL

点击查看摘要

Abstract:A bottleneck in learning to understand articulated 3D objects is the lack of large and diverse datasets. In this paper, we propose to leverage large language models (LLMs) to close this gap and generate articulated assets at scale. We reduce the problem of generating an articulated 3D asset to that of writing a program that builds it. We then introduce a new agentic system, Articraft, that writes such programs automatically. We design a programmatic interface and harness to help the LLM do so effectively. The LLM writes code against a domain-specific SDK for defining parts, composing geometry, specifying joints, and writing tests to validate the resulting assets. The harness exposes a restricted workspace and interface to the LLM, validates the resulting assets, and returns structured feedback. In this way, the LLM is not distracted by details such as authoring a URDF file or managing a complex software environment. We show that this produces higher-quality assets than both state-of-the-art articulated-asset generators and general-purpose coding agents. Using Articraft, we build Articraft-10K, a curated dataset of over 10K articulated assets spanning 245 categories, and show its utility both for training models of articulated assets and in downstream applications such as robotics simulation and virtual reality.

[CV-6] VGGT-Edit: Feed-forward Native 3D Scene Editing with Residual Field Prediction

链接: https://arxiv.org/abs/2605.15186
作者: Kaixin Zhu,Yiwen Tang,Yifan Yang,Renrui Zhang,Bohan Zeng,Ziyu Guo,Ruichuan An,Zhou Liu,Qizhi Chen,Delin Qu,Jaehong Yoon,Wentao Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone’s spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed.

[CV-7] Quantitative Video World Model Evaluation for Geometric-Consistency

链接: https://arxiv.org/abs/2605.15185
作者: Jiaxin Wu,Yihao Pi,Yinling Zhang,Yuheng Li,Xueyan Zou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures. Project page : this https URL

点击查看摘要

Abstract:Generative video models are increasingly studied as implicit world models, yet evaluating whether they produce physically plausible 3D structure and motion remains challenging. Most existing video evaluation pipelines rely heavily on human judgment or learned graders, which can be subjective and weakly diagnostic for geometric failures. We introduce PDI-Bench (Perspective Distortion Index), a quantitative framework for auditing geometric coherence in generated videos. Given a generated clip, we obtain object-centric observations via segmentation and point tracking (e.g., SAM 2, MegaSaM, and CoTracker3), lift them to 3D world-space coordinates via monocular reconstruction, and compute a set of projective-geometry residuals capturing three failure dimensions: scale-depth alignment, 3D motion consistency, and 3D structural rigidity. To support systematic evaluation, we build PDI-Dataset, covering diverse scenarios designed to stress these geometric constraints. Across state-of-the-art video generators, PDI reveals consistent geometry-specific failure modes that are not captured by common perceptual metrics, and provides a diagnostic signal for progress toward physically grounded video generation and physical world model. Our code and dataset can be found at this https URL.

[CV-8] Warp-as-History: Generalizable Camera-Controlled Video Generation from One Training Video

链接: https://arxiv.org/abs/2605.15182
作者: Yifan Wang,Tong He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Camera-controlled video generation has made substantial progress, enabling generated videos to follow prescribed viewpoint trajectories. However, existing methods usually learn camera-specific conditioning through camera encoders, control branches, or attention and positional-encoding modifications, which often require post-training on large-scale camera-annotated videos. Training-free alternatives avoid such post-training, but often shift the cost to test-time optimization or extra denoising-time guidance. We propose Warp-as-History, a simple interface that turns camera-induced warps into camera-warped pseudo-history with target-frame positional alignment and visible-token selection. Given a target camera trajectory, we construct camera-warped pseudo-history from past observations and feed it through the model’s visual-history pathway. Crucially, we align its positional encoding with the target frames being denoised and remove warped-history tokens without valid source observations. Without any training, architectural modification, or test-time optimization, this interface reveals a non-trivial zero-shot capability of a frozen video generation model to follow camera trajectories. Moreover, lightweight offline LoRA finetuning on only one camera-annotated video further improves this capability and generalizes to unseen videos, improving camera adherence, visual quality, and motion dynamics without test-time optimization or target-video adaptation. Extensive experiments on diverse datasets confirm the effectiveness of our method.

[CV-9] From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing

链接: https://arxiv.org/abs/2605.15181
作者: Anirudh Sundara Rajan,Krishna Kumar Singh,Yong Jae Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern image editing models produce realistic results but struggle with abstract, multi step instructions (e.g., ``make this advertisement more vegetarian-friendly’'). Prior agent based methods decompose such tasks but rely on handcrafted pipelines or teacher imitation, limiting flexibility and decoupling learning from actual editing outcomes. We propose an experiential framework for long-horizon image editing, where a planner generates structured atomic decompositions and an orchestrator selects tools and regions to execute each step. A vision language judge provides outcome-based rewards for instruction adherence and visual quality. The orchestrator is trained to maximize these rewards, and successful trajectories are used to refine the planner. By tightly coupling planning with reward driven execution, our approach yields more coherent and reliable edits than single-step or rule-based multistep baselines.

[CV-10] SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer

链接: https://arxiv.org/abs/2605.15178
作者: Haoyi Zhu,Haozhe Liu,Yuyang Zhao,Tian Ye,Junsong Chen,Jincheng Yu,Tong He,Song Han,Enze Xie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL

点击查看摘要

Abstract:We introduce SANA-WM, an efficient 2.6B-parameter open-source world model natively trained for one-minute generation, synthesizing high-fidelity, 720p, minute-scale videos with precise camera control. SANA-WM achieves visual quality comparable to large-scale industrial baselines such as LingBot-World and HY-WorldPlay, while significantly improving efficiency. Four core designs drive our architecture: (1) Hybrid Linear Attention combines frame-wise Gated DeltaNet (GDN) with softmax attention for memory-efficient long-context modeling. (2) Dual-Branch Camera Control ensures precise 6-DoF trajectory adherence. (3) Two-Stage Generation Pipeline applies a long-video refiner to stage-1 outputs, improving quality and consistency across sequences. (4) Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos to yield high-quality, spatiotemporally consistent action labels. Driven by these designs, SANA-WMdemonstrates remarkable efficiency across data, training compute, and inference hardware: it uses only \sim 213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60s clip on a single GPU; its distilled variant can be deployed on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s. On our one-minute world-model benchmark, SANA-WM demonstrates stronger action-following accuracy than prior open-source baselines and achieves comparable visual quality at 36\times higher throughput for scalable world modeling.

[CV-11] Evidential Reasoning Advances Interpretable Real-World Disease Screening ICML2026

链接: https://arxiv.org/abs/2605.15171
作者: Chenyu Lian,Hong-Yu Zhou,Jing Qin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICML 2026

点击查看摘要

Abstract:Disease screening is critical for early detection and timely intervention in clinical practice. However, most current screening models for medical images suffer from limited interpretability and suboptimal performance. They often lack effective mechanisms to reference historical cases or provide transparent reasoning pathways. To address these challenges, we introduce EviScreen, an evidential reasoning framework for disease screening that leverages region-level evidence from historical cases. The proposed EviScreen offers retrospection interpretability through regional evidence retrieved from dual knowledge banks. Using this evidential mechanism, the subsequent evidence-aware reasoning module makes predictions using both the current case and evidence from historical cases, thereby enhancing disease screening performance. Furthermore, rather than relying on post-hoc saliency maps, EviScreen enhances localization interpretability by leveraging abnormality maps derived from contrastive retrieval. Our method achieves superior performance on our carefully established benchmarks for real-world disease screening, yielding notably higher specificity at clinical-level recall. Code is publicly available at this https URL.

[CV-12] Does Synthetic Layered Design Data Benefit Layered Design Decomposition?

链接: https://arxiv.org/abs/2605.15167
作者: Kam Man Wu,Haolin Yang,Qingyu Chen,Yihu Tang,Jingye Chen,Qifeng Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures. Code is available at this https URL

点击查看摘要

Abstract:Recent advances in image generation have made it easy to produce high-quality images. However, these outputs are inherently flattened, entangling foreground elements, background, and text within a fixed canvas. As a result, flexible post-generation editing remains challenging, revealing a clear last-mile gap toward practical usability. Existing approaches either rely on scarce proprietary layered assets or construct partially synthetic data from limited structural priors. However, both strategies face fundamental challenges in scalability. In this work, we investigate whether pure synthetic layered data can improve graphic design decomposition. We make the assumption that, in graphic design, effective decomposition does not require modeling inter-layer dependencies as precisely as in natural-image composition, since design elements are often intentionally arranged as modular and semantically separable components. Concretely, we conduct a data-centric study based on CLD baseline, which is a state-of-the-art layer decomposition framework. Based on the baseline, we construct our own synthetic dataset, SynLayers, generate textual supervision using vision language models, and automate inference inputs with VLM-predicted bounding boxes. Our study reveals three key findings: (1) even training with purely synthetic data can outperform non-scalable alternatives such as the widely used PrismLayersPro dataset, demonstrating its viability as a scalable and effective substitute; (2) performance consistently improves with increased training data scale, while gains begin to saturate at around 50K samples; and (3) synthetic data enables balanced control over layer-count distributions, avoiding the layer-count imbalance commonly observed in real-world datasets. We hope this data-centric study encourages broader adoption of synthetic data as a practical foundation for layered design editing systems.

[CV-13] Causal Forcing: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

链接: https://arxiv.org/abs/2605.15141
作者: Min Zhao,Hongzhou Zhu,Kaiwen Zheng,Zihan Zhou,Bokai Yan,Xinyuan Li,Xiao Yang,Chongxuan Li,Jun Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1–2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose \textbfCausal Forcing++, a principled and scalable pipeline that uses \emphcausal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textit\textbfframe-wise 2-step setting by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50% and Stage 2 training cost by \sim 4\times . We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: this https URL and this https URL .

[CV-14] CLOVER: Closed-Loop Value Estimation Ranking for End-to-End Autonomous Driving Planning

链接: https://arxiv.org/abs/2605.15120
作者: Sining Ang,Yuguang Yang,Canyu Chen,Yan Wang
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:End-to-end autonomous driving planners are commonly trained by imitating a single logged trajectory, yet evaluated by rule-based planning metrics that measure safety, feasibility, progress, and comfort. This creates a training–evaluation mismatch: trajectories close to the logged path may violate planning rules, while alternatives farther from the demonstration can remain valid and high-scoring. The mismatch is especially limiting for proposal-selection planners, whose performance depends on candidate-set coverage and scorer ranking quality. We propose CLOVER, a Closed-LOop Value Estimation and Ranking framework for end-to-end autonomous driving planning. CLOVER follows a lightweight generator–scorer formulation: a generator produces diverse candidate trajectories, and a scorer predicts planning-metric sub-scores to rank them at inference time. To expand proposal support beyond single-trajectory imitation, CLOVER constructs evaluator-filtered pseudo-expert trajectories and trains the generator with set-level coverage supervision. It then performs conservative closed-loop self-distillation: the scorer is fitted to true evaluator sub-scores on generated proposals, while the generator is refined toward teacher-selected top- k and vector-Pareto targets with stability regularization. We analyze when an imperfect scorer can improve the generator, showing that scorer-mediated refinement is reliable when scorer-selected targets are enriched under the true evaluator and updates remain conservative. On NAVSIM, CLOVER achieves 94.5 PDMS and 90.4 EPDMS, establishing a new state of the art. On the more challenging NavHard split, it obtains 48.3 EPDMS, matching the strongest reported result. On supplementary nuScenes open-loop evaluation, CLOVER achieves the lowest L2 error and collision rate among compared methods. Code data will be released at this https URL.

[CV-15] DriveCtrl: Conditioned Sim-to-Real Driving Video Generation

链接: https://arxiv.org/abs/2605.15116
作者: Haonan Zhao,Yiting Wang,Jingkun Chen,Valentina Donzella,Thomas Bashford-Rogers,Kurt Debattista
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale labelled driving video data is essential for training autonomous driving systems. Although simulation offers scalable and fully annotated data, the domain gap between synthetic and real-world driving videos significantly limits its utility for downstream deployment. Existing video generation methods are not well-suited for this task, as they fail to simultaneously preserve scene structure, object dynamics, temporal consistency, and visual realism, all of which are critical for maintaining annotation validity in generated data. In this paper, we present DriveCtrl, a depth-conditioned controllable sim-to-real video generation framework for realistic driving video synthesis. Built upon a pretrained video foundation model, DriveCtrl introduces a structure-aware adapter that enables depth-guided generation while preserving the scene layout and motion patterns of the source simulation, producing temporally coherent driving videos that remain aligned with the original simulated sequences. We further introduce a scalable data generation pipeline that transforms simulator videos into realistic driving footage matching the visual style of a target real-world dataset. The pipeline supports three conditioning signals: structural depth, reference-dataset style, and text prompts, while preserving frame-level annotations for downstream perception tasks. To better assess this task, we propose a driving-domain-specific knowledge-informed evaluation metric called Driving Video Realism Score (DVRS) that assesses the realism of generated videos. Experiments demonstrate that DriveCtrl consistently outperforms the base model and competing alternatives in realism, temporal quality, and perception task performance, substantially narrowing the sim-to-real gap for driving video generation.

[CV-16] CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

链接: https://arxiv.org/abs/2605.15093
作者: Jess Jones,Leonardo Bertini,Kenneth Johnson,Erica Hendy,Tilo Burghardt
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, 2 tables

点击查看摘要

Abstract:The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive \emphPorites sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated \muCT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled \muCT virtual slabs of \emphPorites sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from \muCT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 \muCT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

[CV-17] SAGE3D: Soft-guided attention and graph excitation for 3D point cloud corner detection

链接: https://arxiv.org/abs/2605.15088
作者: Batuhan Arda Bekar,Can Sarı,Hüseyin Can Gülkan,Barış Özcan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:We present SAGE3D, a hybrid Transformer-based model for corner detection in airborne LiDAR point clouds. We propose a multi-stage solution built on a hierarchical encoder-decoder architecture that progressively downsamples point clouds through Set Abstraction layers and recovers per-point predictions via Feature Propagation. We introduce two innovations: Soft-Guided Attention, which injects ground-truth corner labels as a log-prior into attention logits during training to improve precision; then an Excitatory Graph Neural Network positioned at strategic resolutions in the hierarchy, employing positive-only message passing where high-confidence corners reinforce predictions through learned boosting, optimizing for recall. The hierarchical design enables multi-scale feature extraction while our guided attention and excitatory modules ensure corner signals are amplified rather than diluted across scales.

[CV-18] Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection

链接: https://arxiv.org/abs/2605.15062
作者: Chengshuai Yang,Lei Xing,Gregory Entin,Roopa Vemulapalli,Lisa Casey,Raiyan Tripti Zaman
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at this https URL . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A

点击查看摘要

Abstract:Background. RGB-trained capsule-endoscopy classifiers underperform on small-vessel vascular findings by conflating hemoglobin contrast with bile and illumination falloff. Thus, here we test whether a Monte Carlo-inspired analytic model can compute hemoglobin from RGB signal built upon extracted classifier. Methods. On Kvasir-Capsule (47,238 frames, video-level 70/15/15 split, 11 evaluable classes) we evaluate two software-only configurations against RGB-only EfficientNet-B0 across 6 seeds: (i) a prior P_blood = sigma(alpha * (H_norm - 0.5)) * Phi® fused as 2 zero-init auxiliary channels; (ii) a distillation head training a 3-channel RGB backbone to predict P_blood. Significance: paired DeLong, McNemar, bootstrap CIs with Bonferroni correction. Results. Across 6 seeds (n=6,423), the analytic prior provides a small but direction-consistent macro-AUC improvement: RGB-only 0.760 +/- 0.027, input-fusion 0.783 +/- 0.024 (paired Delta = +0.023, sign-positive on 5/6 seeds), distillation 0.773 +/- 0.028. The largest robust per-class lift is on Lymphangiectasia, where AUC rises from RGB 0.238 +/- 0.057 to input-fusion 0.337 +/- 0.019, sign-consistent across all 6 seeds. On rare focal-vascular classes (Angiectasia, Blood - fresh) the prior’s per-seed effects are bimodal: seed=42 reaches Angiectasia AUC 0.528 - 0.916, but the cross-seed mean is 0.646 - 0.608 with sigma_PI = 0.23 - reported as a high-variance per-seed exemplar. Conclusion. A Monte Carlo-inspired analytic prior provides a small, direction-consistent macro-AUC improvement on Kvasir-Capsule across 6 seeds with the largest robust per-class lift on Lymphangiectasia; the distillation variant runs on plain 3-channel RGB and yields a free interpretability heatmap. Comments: 24 pages, 6 figures, 3 tables. Code and trained-model checkpoints at this https URL . 6-seed (seeds 41, 42, 43, 44, 45, 47) mean +/- SD ablation as the headline; per-class single-seed=42 analyses in Appendix A Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T07, 68T45, 92C55 Cite as: arXiv:2605.15062 [cs.CV] (or arXiv:2605.15062v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.15062 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Chengshuai Yang [view email] [v1] Thu, 14 May 2026 16:52:33 UTC (4,462 KB) Full-text links: Access Paper: View a PDF of the paper titled Computational Imaging Priors for Wireless Capsule Endoscopy: Monte Carlo-Guided Hemoglobin Mapping for Rare-Anomaly Detection, by Chengshuai Yang and 5 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[CV-19] DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

链接: https://arxiv.org/abs/2605.15055
作者: Quanhao Li,Junqiu Yu,Kaixun Jiang,Yujie Wei,Zhen Xing,Pandeng Li,Ruihang Chu,Shiwei Zhang,Yu Liu,Zuxuan Wu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reinforcement learning has emerged as a powerful tool for improving diffusion-based text-to-image models, but existing methods are largely limited to single-task optimization. Extending RL to multiple tasks is challenging: joint optimization suffers from cross-task interference and imbalance, while cascade RL is cumbersome and prone to catastrophic forgetting. We propose DiffusionOPD, a new multi-task training paradigm for diffusion models based on Online Policy Distillation (OPD). DiffusionOPD first trains task-specific teachers independently, then distills their capabilities into a unified student along the student own rollout trajectories. This decouples single-task exploration from multi-task integration and avoids the optimization burden of solving all tasks jointly from scratch. Theoretically, we lift the OPD framework from discrete tokens to continuous-state Markov processes, deriving a closed-form per-step KL objective that unifies both stochastic SDE and deterministic ODE refinement via mean-matching. We formally and empirically demonstrate that this analytic gradient provides lower variance and better generality compared to conventional PPO-style policy gradients. Extensive experiments show that DiffusionOPD consistently surpasses both multi-reward RL and cascade RL baselines in training efficiency and final performance, while achieving state-of-the-art results on all evaluated benchmarks.

[CV-20] LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection

链接: https://arxiv.org/abs/2605.15054
作者: Mitchell Piehl,Muchao Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have recently emerged as a promising paradigm for video anomaly detection (VAD) due to their strong visual reasoning ability and natural language-based explainability. In this paper, we aim to address a key limitation of such pipelines, which perform segment-level inference independently owing to token constraints and reason without structured temporal context, allowing VLMs to interpret anomalies as deviations from evolving video dynamics rather than producing fragmented predictions and explanations. To specify, we propose a context-aware framework named LATERN, which reformulates VAD as a temporal evidence aggregation process. LATERN consists of two complementary modules: Context-Aware Anomaly Scoring (CEA) and Recursive Evidence Aggregation (REA). CEA introduces a novel image-grounded memory mechanism, which selectively chooses historical content via frame diversity and visual-textual alignment as expanded context to help generate reliable anomaly scores. Building upon these scores, REA performs recursive temporal aggregation to identify coherent anomaly intervals and produce event-level decisions and explanations grounded in visual-textual evidence. Extensive experiments on challenging benchmarks, including UCF-Crime and XD-Violence, show that LATERN enhances detection accuracy and explanation consistency for frozen VLMs during test time, while generating temporally coherent and semantically grounded event-level explanations.

[CV-21] EverAnimate: Minute-Scale Human Animation via Latent Flow Restoration

链接: https://arxiv.org/abs/2605.15042
作者: Wuyang Li,Yang Gao,Mariam Hassan,Lan Feng,Wentao Pan,Po-Chien Luan,Alexandre Alahi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:We propose EverAnimate, an efficient post-training method for long-horizon animated video generation that preserves visual quality and character identity. Long-form animation remains challenging because highly dynamic human motion must be synthesized against relatively static environments, making chunk-based generation prone to accumulated drift: (i) low-level quality drift, such as progressive degradation of static backgrounds, and (ii) high-level semantic drift, such as inconsistent character identity and view-dependent attributes. To address this issue, EverAnimate restores drifted flow trajectories by anchoring generation to a persistent latent context memory, consisting of two complementary mechanisms. (i) Persistent Latent Propagation maintains a context memory across chunks to propagate identity and motion in latent space while mitigating temporal forgetting. (ii) Restorative Flow Matching introduces an implicit restoration objective during sampling through velocity adjustment, improving within-chunk fidelity. With only lightweight LoRA tuning, EverAnimate outperforms state-of-the-art long-animation methods in both short- and long-horizon settings: at 10 seconds, it improves PSNR/SSIM by 8%/7% and reduces LPIPS/FID by 22%/11%; at 90 seconds, the gains increase to 15%/15% and 32%/27%, respectively.

[CV-22] HiSem: Hierarchical Semantic Disentangling for Remote Sensing Image Change Captioning

链接: https://arxiv.org/abs/2605.15024
作者: Man Wang,Chenyang Liu,Wenjun Li,Feng Ni,Bing Jia,Baoqi Huang,Riting Xia,Zhenwei Shi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image change captioning (RSICC) aims to achieve high-level semantic understanding of genuine changes occurring between bi-temporal images. Despite notable progress, existing methods are fundamentally limited by a shared modeling assumption: changed and unchanged image pairs, which have intrinsically different semantic granularities, are processed under a unified modeling strategy. This modeling inconsistency leads to semantic entanglement between coarse-grained change existence judgment and fine-grained semantic this http URL address the above limitation, we propose a novel hierarchical semantic disentangling network (HiSem) that explicitly disentangles semantic representations of different granularities. Specifically, we first introduce the Bidirectional Differential Attention Modulation (BDAM) module that leverages discrepancy-aware attention to enhance cross-temporal interactions, thereby amplifying true change signals while suppressing irrelevant variations. Building upon this, we design a Hierarchical Adaptive Semantic Disentanglement (HASD) module that performs adaptive routing at two hierarchical levels: a coarse-grained image-level routing mechanism distinguishes changed and unchanged image pairs, while a fine-grained token-level Mixture-of-Experts (MoE) block models diverse and heterogeneous change semantics for changed samples. Extensive experiments on two benchmark datasets demonstrate that HiSem outperfoms previous methods, achieving a significant improvement of +7.52% BLEU-4 on the WHU-CDC dataset. More importantly, our approach provides a structured perspective for RSICC by explicitly aligning model design with the intrinsic semantic heterogeneity of bi-temporal scenes. The code will be available at this https URL

[CV-23] 3D Skew-Normal Splatting

链接: https://arxiv.org/abs/2605.15010
作者: Xiangru Wu,Ke Fan,Yanwei Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions, yet they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moremover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization this http URL, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

[CV-24] Predicting Response to Neoadjuvant Chemotherapy in Ovarian Cancer from CT Baseline Using Multi-Loss Deep Learning

链接: https://arxiv.org/abs/2605.14991
作者: Francesco Pastori,Francesca Fati,Marina Rosanu,Luigi De Vitis,Lucia Ribero,Gabriella Schivardi,Giovanni Damiano Aletti,Nicoletta Colombo,Jvan Casarin,Francesco Multinu,Elena De Momi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ovarian cancer is the most lethal gynecologic malignancy: around 60% of patients are diagnosed at an advanced stage, with an associated 5-year survival rate of about 30%. Early identification of non-responders to neoadjuvant chemotherapy remains a key unmet need, as it could prevent ineffective therapy and avoid delays in optimal surgical management. This work proposes a non-invasive deep learning framework to predict neoadjuvant chemotherapy response from pre-treatment contrast-enhanced CT by leveraging automatically derived 3D lesion masks. The approach encodes axial slices with a partially fine-tuned pretrained image encoder and aggregates slice-level representations into a volumetric embedding through an attention-based module. Training combines classification loss with supervised contrastive regularization and hard-negative mining to improve separation between ambiguous responders and non-responders. The method was developed on a retrospective single-center cohort from the European Institute of Oncology (Milan, IT), including 280 eligible patients (147 responder, 133 non-responder). On the test cohort, the model achieved a ROC-AUC of 0.73 (95% CI: 0.58-0.86) and an F1-score of 0.70 (95% CI: 0.56-0.82). Overall, these results suggest that the proposed architecture learns clinically relevant predictive patterns and provides a robust foundation for an imaging-based stratification tool.

[CV-25] Characterizing the visual representation of objects from the childs view

链接: https://arxiv.org/abs/2605.14990
作者: Jane Yang,Tarun Sepuri,Alvin Wei Ming Tan,Khai Loong Aw,Michael C. Frank,Bria Long
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 6 figures

点击查看摘要

Abstract:Children acquire object category representations from their everyday experiences in the first few years of life. What do the inputs to this learning process look like? We analyzed first-person videos of young children’s visual experience at home from the BabyView dataset ( N = 31 participants, 868 hours, ages 5–36 months), using a supervised object detection model to extract common object categories from more than 3 million frames. We found that children’s object category exposure was highly skewed: a few categories (e.g., cups, chairs) dominated children’s visual experiences while most categories appeared rarely, replicating previous findings from a more restricted set of contexts. Category exemplars were highly variable: children encountered objects from unusual angles, in highly cluttered scenes, and partially occluded views; many categories (especially animals) were most frequently viewed as depictions. Surprisingly, despite this variability, detected categories (e.g., giraffes, apples) showed stronger groupings within superordinate categories (e.g., animals, food) relative to groupings derived from canonical photographs of these categories. We found this same pattern when using high-dimensional embeddings from both self-supervised visual and multimodal models; this effect was also recapitulated in densely sampled data from individual children. Understanding the robustness and efficiency of visual category learning will require the development of models that can exploit strong superordinate structure and learn from non-canonical, sparse, and variable exemplars.

[CV-26] Compositional Video Generation via Inference-Time Guidance

链接: https://arxiv.org/abs/2605.14988
作者: Ariel Shaulov,Eitan Shaar,Amit Edenzon,Gal Chechik,Lior Wolf
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model’s own internal grounding signals. We propose \textbfCVG, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.

[CV-27] Sat3DGen: Comprehensive Street-Level 3D Scene Generation from Single Satellite Image ICLR2026

链接: https://arxiv.org/abs/2605.14984
作者: Ming Qian,Zimin Xia,Changkun Liu,Shuailei Ma,Wen Wang,Zeran Ke,Bin Tan,Hang Zhang,Gui-Song Xia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICLR 2026; code: this https URL demo: this https URL project page: this https URL

点击查看摘要

Abstract:Generating a street-level 3D scene from a single satellite image is a crucial yet challenging task. Current methods present a stark trade-off: geometry-colorization models achieve high geometric fidelity but are typically building-focused and lack semantic diversity. In contrast, proxy-based models use feed-forward image-to-3D frameworks to generate holistic scenes by jointly learning geometry and texture, a process that yields rich content but coarse and unstable geometry. We attribute these geometric failures to the extreme viewpoint gap and sparse, inconsistent supervision inherent in satellite-to-street data. We introduce Sat3DGen to address these fundamental challenges, which embodies a geometry-first methodology. This methodology enhances the feed-forward paradigm by integrating novel geometric constraints with a perspective-view training strategy, explicitly countering the primary sources of geometric error. This geometry-centric strategy yields a dramatic leap in both 3D accuracy and photorealism. For validation, we first constructed a new benchmark by pairing the VIGOR-OOD test set with high-resolution DSM data. On this benchmark, our method improves geometric RMSE from 6.76m to 5.20m. Crucially, this geometric leap also boosts photorealism, reducing the Fréchet Inception Distance (FID) from \sim 40 to 19 against the leading method, Sat2Density++, despite using no extra tailored image-quality modules. We demonstrate the versatility of our high-quality 3D assets through diverse downstream applications, including semantic-map-to-3D synthesis, multi-camera video generation, large-scale meshing, and unsupervised single-image Digital Surface Model (DSM) estimation. The code has been released on this https URL.

[CV-28] MicroscopyMatching: Towards a Ready-to-use Framework for Microscopy Image Analysis in Diverse Conditions

链接: https://arxiv.org/abs/2605.14980
作者: Xiaofei Hui,Haoxuan Qu,Hossein Rahmani,Shuohong Wang,Jeff W. Lichtman,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analyzing microscopy images to extract biological object properties (e.g., their morphological organization, temporal dynamics, and population density) is fundamental to various biomedical research. Yet conducting this manually is costly and time-consuming. Though deep learning-based approaches have been explored to automate this process, the substantial diversity of microscopy analysis settings in practice (including variations of biological object types, sample processing protocols, imaging equipment, and analysis tasks, etc.) often renders them ineffective. As a result, these approaches typically require extensive adaptation for different settings, which, however, can impose burdens that are often practically unsustainable for laboratories, forcing biomedical researchers to still commonly rely on manual analysis, thereby severely bottlenecking the pace of biomedical research progress. This situation has created a pressing and long-standing need for a reliable and broadly applicable microscopy image analysis tool, yet such a tool is still missing. To address this gap, we present the first ready-to-use microscopy image analysis framework, MicroscopyMatching, that can reliably perform key analysis tasks (including segmentation, tracking, and counting) across diverse microscopy analysis settings. From a fundamentally different perspective, MicroscopyMatching reformulates diverse microscopy image analysis tasks as a unified matching problem, effectively handling this problem by exploiting the robust matching capability from pre-trained latent diffusion models.

[CV-29] MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs

链接: https://arxiv.org/abs/2605.14966
作者: Wei Ding,Yilin Li,Yudong Zhang,Ruobing Xie,Xingwu Sun,Jiansheng Chen,Yu Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 19 pages, 17 figures

点击查看摘要

Abstract:Large vision-language models (LVLMs) have achieved remarkable performance across diverse multimodal tasks, yet they continue to suffer from hallucinations, generating content that is inconsistent with the visual input. Prior work DHCP (Detecting Hallucinations by Cross-modal Attention Pattern) has explored hallucination detection from the perspective of cross-modal attention, but does not address hallucination mitigation. In this paper, we propose MHSA (Mitigating Hallucinations via Steered Attention), a lightweight framework that mitigates hallucinations by learning to correct cross-modal attention patterns in LVLMs. MHSA trains a simple three-layer MLP generator to produce corrected attention, guided by supervisory signals from the DHCP discriminator and the LVLM itself. During inference, MHSA mitigates both discriminative and generative hallucinations across various datasets and LVLMs by simply replacing the original cross-modal attention with the corrected one, without modifying any LVLM parameters. By extending cross-modal attention mechanisms from hallucination detection to hallucination mitigation, MHSA offers a novel perspective on hallucination research in LVLMs and helps enhance their reliability.

[CV-30] H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

链接: https://arxiv.org/abs/2605.14963
作者: Chenxing Jiang,Zhe Tong,Pusen Gao,Peize Liu,Yang Xu,Chuan Fang,Ping Tan,Shaojie Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical this http URL address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV this http URL experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. Both the model and the dataset will be open-sourced.

[CV-31] Meschers: Geometry Processing of Impossible Objects

链接: https://arxiv.org/abs/2605.14960
作者: Ana Dodik,Isabella Yu,Kartik Chandra,Jonathan Ragan-Kelley,Joshua Tenenbaum,Vincent Sitzmann,Justin Solomon
类目: Graphics (cs.GR); Computational Geometry (cs.CG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Impossible objects, geometric constructions that humans can perceive but that cannot exist in real life, have been a topic of intrigue in visual arts, perception, and graphics, yet no satisfying computer representation of such objects exists. Previous work embeds impossible objects in 3D, cutting them or twisting/bending them in the depth axis. Cutting an impossible object changes its local geometry at the cut, which can hamper downstream graphics applications, such as smoothing, while bending makes it difficult to relight the object. Both of these can invalidate geometry operations, such as distance computation. As an alternative, we introduce Meschers, meshes capable of representing impossible constructions akin to those found in M.C. Escher’s woodcuts. Our representation has a theoretical foundation in discrete exterior calculus and supports the use-cases above, as we demonstrate in a number of example applications. Moreover, because we can do discrete geometry processing on our representation, we can inverse-render impossible objects. We also compare our representation to cut and bend representations of impossible objects.

[CV-32] Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model

链接: https://arxiv.org/abs/2605.14950
作者: Tao Lin,Yuxin Du,Jiting Liu,Nuobei Zhu,Yunhe Li,Yuqian Fu,Yinxinyu Chen,Hongyi Cai,Zewei Ye,Bing Cheng,Kai Ye,Yiran Mao,Yilei Zhong,MingKang Dong,Junchi Yan,Gen Li,Bo Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a lightweight depth-enhanced VLA framework that enhances spatially grounded manipulation without relying on additional sensing hardware or compromising deployment efficiency. Evo-Depth employs a lightweight Implicit Depth Encoding Module to extract compact depth features from multi-view RGB images. These features are incorporated into vision-language representations through a Spatial Enhancement Module via depth-aware modulation, enabling efficient spatial-semantic enhancement. A Progressive Alignment Training strategy is further introduced to align the resulting depth-enhanced representations with downstream action learning. With only 0.9B parameters, Evo-Depth achieves superior performance across four simulation benchmarks. In real-world experiments, Evo-Depth attains the highest average success rate while also exhibiting the smallest model size, lowest GPU memory usage, and highest inference frequency among compared methods.

[CV-33] A CUBS-Compatible Ultrasound Morphology and Uncertainty-Aware Baseline for Carotid Intima-Media Segmentation and Preliminary Risk Prediction

链接: https://arxiv.org/abs/2605.14949
作者: Aueaphum Aueawatthanaphisut
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注: 13 pages, 5 figures, 2 tables, 20 equations, 3 appendices

点击查看摘要

Abstract:Carotid atherosclerosis is a major contributor to ischemic stroke and transient ischemic attack. Conventional ultrasound assessment is commonly based on intima-media thickness, plaque appearance, stenosis degree, and peak systolic velocity, but these morphology- and velocity-based indicators may not fully capture patient-specific vascular risk. This study presents AtheroFlow-XNet, a CUBS-compatible ultrasound morphology and uncertainty-aware learning baseline for carotid intima-media segmentation and preliminary risk prediction. Using the Carotid Ultrasound Boundary Study dataset, manual lumen-intima and media-adventitia boundary annotations were converted into dense intima-media masks for supervised segmentation. Clinical variables were incorporated into an auxiliary risk-prediction branch, and Monte Carlo dropout was used for uncertainty-aware inference. The model was evaluated using a patient-level train-validation-test split with 1,522 training images, 326 validation images, and 328 testing images. The proposed model achieved a Dice coefficient of 0.7930 for LI-MA mask segmentation, a segmentation loss of 0.2359, and an area under the receiver operating characteristic curve of 0.6910 for preliminary risk prediction. Qualitative results showed that predicted masks were generally aligned with manual annotations, while uncertainty maps highlighted ambiguous wall-boundary regions. These results suggest that ultrasound-derived carotid morphology can support automated wall analysis and uncertainty-aware interpretation. Since CUBS does not provide Doppler waveforms or CFD-derived hemodynamic biomarkers, this work should be interpreted as a reproducible morphology-driven baseline. Future work will incorporate Doppler-derived flow profiles, patient-specific vascular reconstruction, and CFD-based wall shear biomarkers.

[CV-34] ACE-LoRA: Adaptive Orthogonal Decoupling for Continual Image Editing

链接: https://arxiv.org/abs/2605.14948
作者: Yuehao Liu,Weijia Zhang,Xuanming Shang,Zhizhou Chen,Yanhao Ge,Shanyan Guan,Chao Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:State-of-the-art diffusion models often rely on parameter-efficient fine-tuning to perform specialized image editing tasks. However, real-world applications require continual adaptation to new tasks while preserving previously learned knowledge. Despite the practical necessity, continual learning for image editing remains largely underexplored. We propose ACE-LoRA, a dynamic regularization framework for continual image editing that effectively mitigates catastrophic forgetting. ACE-LoRA leverages Adaptive Orthogonal Decoupling to identify and orthogonalize task interference, and introduces a Rank-Invariant Historical Information Compression strategy to address scalability issues in continual updates. To facilitate continual learning in image editing and provide a standardized evaluation protocol, we introduce CIE-Bench, the first comprehensive benchmark in this domain. CIE-Bench encompasses diverse and practically relevant image editing scenarios with a balanced level of difficulty to effectively expose limitations of existing models while remaining compatible with parameter-efficient fine-tuning. Extensive experiments demonstrate that our method consistently outperforms existing baselines in terms of instruction fidelity, visual realism, and robustness to forgetting, establishing a strong foundation for continual learning in image editing.

[CV-35] Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models

链接: https://arxiv.org/abs/2605.14938
作者: Yuehao Liu,Shanyan Guan,Weijia Zhang,Xuanming Shang,Yanhao Ge,Wei Li,Chao Ma
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning in multimodal large language models (MLLMs) aims to sequentially acquire knowledge while mitigating catastrophic forgetting, yet existing methods face inherent limitations: architecture-based approaches incur additional computational overhead and often generalize poorly to new tasks, rehearsal-based methods rely on storing historical data, raising privacy and storage concerns, and conventional regularization-based strategies alone are insufficient to fully prevent parameter interference. We propose Octopus, a two-stage continual learning framework based on History-Free Gradient Orthogonalization (HiFGO), which enforces gradient-level orthogonality without historical task data. Our proposed two-stage finetuning strategy decouples task adaptation from regularization, achieving a principled balance between plasticity and stability. Experiments on UCIT show that Octopus establishes state-of-the-art performance, surpassing prior SOTA by 2.14% and 6.82% in terms of Avg and Last.

[CV-36] Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

链接: https://arxiv.org/abs/2605.14935
作者: Nhat Le,Daochang Liu,Anh Nguyen,Ajmal Mian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with the control objectives. MSCoT is able to produce quality motions, consistent with the control constraints, while offering substantially faster sampling than diffusion-based approaches. Experiments on popular benchmarks demonstrate state-of-the-art controllable text-to-motion generation performance of MSCoT over existing baselines, with better motion quality (48% FID improvement), higher control accuracy (-61% avg error), and 10 \times faster inference speed on HumanML3D.

[CV-37] SCRWKV: Ultra-Compact Structure-Calibrated Vision-RWKV for Topological Crack Segmentation ICML2026

链接: https://arxiv.org/abs/2605.14926
作者: Hanxu Zhang,Chen Jia,Hui Liu,Xu Cheng,Fan Shi,Shengyong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accept by ICML2026

点击查看摘要

Abstract:Achieving pixel-level accurate segmentation of structural cracks across diverse scenarios remains a formidable challenge. Existing methods face significant bottlenecks in balancing crack topology modeling with computational efficiency, often failing to reconcile high segmentation quality with low resource demands. To address these limitations, we propose the Ultra-Compact Structure-Calibrated Vision RWKV (SCRWKV), a network that achieves high-precision modeling via a novel Structure-Field Encoder (SFE) backbone while maintaining linear complexity. The SFE integrates the Adaptive Multi-scale Cascaded Modulator (AMCM) to enhance texture representation and utilizes the Structure-Calibrated Insight Unit (SCIU) as its core engine. Specifically, the SCIU employs the Geometry-guided Bidirectional Structure Transformation (GBST) to capture topological correlations and integrates the Dynamic Self-Calibrating Decay (DSCD) into Dy-WKV to suppress noise propagation. Furthermore, we introduce a lightweight Cross-Scale Harmonic Fusion (CSHF) decoder to achieve precise feature aggregation. Systematic evaluations on multiple benchmarks characterized by complex textures and severe interference demonstrate that SCRWKV, with only 1.22M parameters, significantly outperforms SOTA methods. Achieving an F1 score of 0.8428 and mIoU of 0.8512 on the TUT dataset, the model confirms its robust potential for efficient real-world deployment. The code is available at this https URL.

[CV-38] Road Maps as Free Geometric Priors: Weather-Invariant Drone Geo-Localization with GeoFuse

链接: https://arxiv.org/abs/2605.14925
作者: Yunsong Fang(1),Tingyu Wang(2),Zhedong Zheng(1) ((1) University of Macau, (2) Hangzhou Dianzi University)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 18 pages, 4 figures

点击查看摘要

Abstract:Drone-view geo-localization aims to match a query drone image, often captured under adverse weather conditions (e.g., rain, snow, fog), against a gallery of geo-tagged satellite images. Weather-induced degradations in the drone view, such as noise, reduced visibility, and partial occlusions, severely exacerbate the intrinsic cross-view domain gap. While prior methods predominantly rely on weather-specific architectures or data augmentations, they have largely overlooked road map data, a readily available modality that provides strong, inherently weather-invariant geometric layout cues (e.g., road networks and building footprints) at negligible additional cost. We introduce GeoFuse, a cross-modal fusion framework that integrates precisely aligned road map tiles with satellite imagery to yield more discriminative and weather-resilient representations. We first augment the existing University-1652 and DenseUAV benchmarks with geo-aligned road maps, supplying structural priors robust to meteorological variations. Building on this, we propose a flexible fusion module that combines satellite and road map features via token-level and channel-level interactions, with a lightweight dynamic gating mechanism that adaptively weights modality contributions per instance. Finally, we employ class-level cross-view contrastive learning to promote robust alignment between weather-degraded drone features and the fused satellite-roadmap representations. Extensive experiments under diverse weather conditions show that GeoFuse consistently outperforms state-of-the-art methods, achieving +3.46% and +23.18% Recall@1 accuracy on the University-1652 and DenseUAV benchmarks, respectively.

[CV-39] SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

链接: https://arxiv.org/abs/2605.14923
作者: Pengxin Xu,Xincheng Lin,Luping Xiao,Qing Jiang,Meishan Zhang,Hao Fei,Shanghang Zhang,Xingyu Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Code, models, and dataset are provided in the manuscript

点击查看摘要

Abstract:General scene perception has progressed from object recognition toward open-vocabulary grounding, part localization, and affordance prediction. Yet these capabilities are often realized as isolated predictions that localize objects, parts, or interaction points without capturing the structured dependencies needed for interaction-oriented scene understanding. To address this gap, we introduce Hierarchical Scene Parsing, an interaction-oriented parsing task that represents physical scenes as explicit scene - object - part - affordance hierarchies with cross-level bindings. We instantiate this task with SceneParser, a VLM-based parser trained for unified hierarchical generation with structural-completion pseudo labels and curriculum learning. To support training and evaluation, we construct SceneParser-Bench, a large-scale benchmark built with a scalable hierarchical data engine, containing 110K training images, a 5K validation split, 777K objects, 1.14M parts, 1.74M affordance annotations, and 1.74M valid object-part-affordance chain instances. We further introduce Level-1 to Level-3 conditional metrics and ParseRate to evaluate localization, cross-level binding, and hierarchical completeness. Experiments show that existing MLLMs and perception-stitching pipelines struggle with hierarchical parsing on our SceneParser-Bench, while SceneParser achieves stronger structure-aware performance. Besides, ablations, evaluations on COCO and AGD20K, and a downstream planning probe demonstrate that our SceneParser is compatible with conventional tasks and provides an actionable representation for visual understanding.

[CV-40] Representative Attention For Vision Transformers

链接: https://arxiv.org/abs/2605.14913
作者: Yuntong Li,Hainuo Wang,Hengxing Liu,Mingjia Li,Xiaojie Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Linear attention has emerged as a promising direction for scaling Vision Transformers beyond the quadratic cost of dense self-attention. A prevalent strategy is to compress spatial tokens into a compact set of intermediate proxies that mediate global information exchange. However, existing methods typically derive these proxy tokens from predefined spatial layouts, causing token compression to remain anchored to image coordinates rather than the semantic organization of visual content. To overcome this limitation, we propose Representative Attention (RPAttention), a linear global attention mechanism that performs token compression directly in representation space. Instead of constructing intermediate tokens from fixed spatial partitions, it dynamically forms a compact set of learned representative tokens to enable semantically related regions to communicate regardless of their spatial distance, by following a lightweight Gather-Interact-Distribute paradigm. Spatial tokens are first softly gathered into representative tokens through competitive similarity-based routing. The representatives then perform global interaction within a compact latent space, before broadcasting the refined information back to all spatial tokens via query-driven cross-attention. Via replacing coordinate-driven aggregation with representation-driven compression, RPAttention preserves global receptive fields while adaptively aligning token communication with the content structure of each this http URL reduces the dominant token interaction complexity from quadratic to linear scaling with respect to the number of spatial tokens, while maintaining expressive global context modeling. Extensive experiments across diverse vision transformer backbones on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our design.

[CV-41] SteerSeg: Attention Steering for Reasoning Video Segmentation

链接: https://arxiv.org/abs/2605.14908
作者: Ali Cheraghian,Hamidreza Dastmalchi,Abdelwahed Khamis,Morteza Saberi,Aijun An,Lars Petersson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Video reasoning segmentation requires localizing objects across video frames from natural language expressions, often involving spatial reasoning and implicit references. Recent approaches leverage frozen large vision-language models (LVLMs) by extracting attention maps and using them as spatial priors for segmentation, enabling training-free grounding. However, these attention maps are optimized for text generation rather than spatial localization, often resulting in diffuse and ambiguous grounding signals. In this work, we introduce SteerSeg, a lightweight framework that identifies attention misalignment as the key bottleneck in attention-based grounding and proposes to steer attention at its source through input-level conditioning. SteerSeg combines learnable soft prompts with reasoning-guided Chain-of-Thought (CoT) prompting. The soft prompts reshape the attention distribution to produce more spatially concentrated maps, while CoT-derived attributes resolve ambiguity among similar objects by guiding attention toward the correct instance. The resulting attention maps are converted into point prompts across keyframes to guide a segmentation model, while candidate tracklets are ranked and selected using correlation-based scoring. Our approach freezes the LVLM and segmentation model parameters and learns only a small set of soft prompts, preserving the model’s pretrained reasoning capabilities while significantly improving grounding. Despite being trained only on Ref-YouTube-VOS, SteerSeg generalizes well across diverse benchmarks, significantly improving the spatial grounding capability of LVLMs. Project page: this https URL

[CV-42] MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

链接: https://arxiv.org/abs/2605.14906
作者: Xiyu Ren,Zhaowei Wang,Yiming Du,Zhongwei Xie,Chi Liu,Xinlin Yang,Haoyue Feng,Wenjun Pan,Tianshi Zheng,Baixuan Xu,Zhengnan Li,Yangqiu Song,Ginny Wong,Simon See
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Work in progress

点击查看摘要

Abstract:Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at this https URL.

[CV-43] SEDiT: Mask-Free Video Subtitle Erasure via One-step Diffusion Transformer

链接: https://arxiv.org/abs/2605.14894
作者: Zheng Hui,Yunlong Bai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this http URL

点击查看摘要

Abstract:Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.

[CV-44] Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers

链接: https://arxiv.org/abs/2605.14893
作者: Jakub Grzywaczewski,Dawid Płudowski,Przemysław Biecek
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contrastively pre-trained Vision-Language Models (VLMs) serve as powerful feature extractors. Yet, their shared latent spaces are prone to structural anomalies and act as repositories for non-semantic, multi-modal noise. To address this phenomenon, we employ spectral decomposition of covariance matrices to decompose the VLM latent space into a multi-modal semantic signal component and a shared noise subspace. We observe that this noise geometry exhibits strong subgroup invariance across distinct data subsets. Crucially, pruning these shared noise dimensions is mainly harmless, preserving or actively improving downstream task performance. By isolating true semantic signals from artifactual noise, this work provides new mechanistic insights into the representational structure of modern VLMs, suggesting that a substantial fraction of their latent geometry is governed by shared, architecture-level noise rather than task-relevant semantics alone.

[CV-45] Hierarchical Image Tokenization for Multi-Scale Image Super Resolution ICML2026

链接: https://arxiv.org/abs/2605.14891
作者: Isma Hadji,Enrique Sanchez,Adrian Bulat,Brais Martinez,Georgios Tzimiropoulos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICML 2026. *Joint first authorship (alphabetical order). arXiv admin note: substantial text overlap with arXiv:2506.04990

点击查看摘要

Abstract:We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbfHierarchical Image Tokenization (HIT) approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbfDirect Preference Optimization (DPO) regularization term that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

[CV-46] SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

链接: https://arxiv.org/abs/2605.14889
作者: Sukju Oh,Sukkyu Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 7 figures, 10 tables; Code available at this https URL

点击查看摘要

Abstract:Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2’s structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path’s effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 119 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at this https URL.

[CV-47] Masked Next-Scale Prediction for Self-supervised Scene Text Recognition CVPR

链接: https://arxiv.org/abs/2605.14885
作者: Zhuohao Chen,Zeng Li,Yifei Zhang,Chang Liu,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 Findings Track.10 pages, 4 figures

点击查看摘要

Abstract:Scene Text Recognition requires modeling visual structures that evolve from coarse layouts to fine-grained character strokes. Training such models relies on large amounts of annotated data. Recent self-supervised approaches, such as Masked Image Modeling (MIM), alleviate this dependency by leveraging large-scale unlabeled data. Yet most existing MIM methods operate at a single spatial scale and fail to capture the hierarchical nature of scene text. In this work, we introduce Masked Next-Scale Prediction (MNSP), a unified self-supervised framework designed to explicitly model cross-scale structural evolution. The framework incorporates Next-Scale Prediction (NSP), which learns hierarchical representations by predicting higher-resolution features from lower-resolution contexts. Naive scale prediction, however, tends to produce spatially diffuse attention, directing the model toward background regions rather than textual structures. MNSP resolves this limitation by jointly learning cross-scale prediction and masked image reconstruction. NSP captures global layout priors across resolutions, while masked reconstruction imposes strong local constraints that guide attention toward informative text regions. A Multi-scale Linguistic Alignment module further maintains semantic consistency across different resolutions. Extensive experiments demonstrate that MNSP achieves state-of-the-art performance, reaching 86.2% average accuracy on the challenging Union14M benchmark and 96.7% across six standard datasets. Additional analyses show that our method improves robustness under extreme scale and layout variations. Code is available at this https URL

[CV-48] Denoising-GS: Gaussian Splatting with Spatial-aware Denoising

链接: https://arxiv.org/abs/2605.14880
作者: Qingyuan Zhou,Xinyi Liu,Weidong Yang,Ning Wang,Shuquan Ye,Ben Fei,Ying He,Wanli Ouyang
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have achieved remarkable success in high-fidelity Novel View Synthesis (NVS), yet the optimization process inevitably introduces noisy Gaussian primitives due to the sparse and incomplete initialization from Structure-from-Motion (SfM) point clouds. Most existing methods focus solely on adjusting the positions of primitives during optimization, while neglecting the underlying spatial structure. To this end, we introduce a new perspective by formulating the optimization of 3DGS as a primitive denoising process and propose Denoising-GS, a spatial-aware denoising framework for Gaussian primitives by taking both the positions and spatial structure into consideration. Specifically, we design an optimizer that preserves the spatial optimization flow of primitives, facilitating coherent and directed denoising rather than random perturbations. Building upon this, the Spatial Gradient-based Denoising strategy jointly considers the spatial supports of primitives to ensure gradient-consistent updates. Furthermore, the Uncertainty-based Denoising module estimates primitive-wise uncertainty to prune redundant or noisy primitives, while the Spatial Coherence Refinement strategy selectively splits primitives in sparse regions to maintain structural completeness. Experiments conducted on three benchmark datasets demonstrate that Denoising-GS consistently enhances NVS fidelity while maintaining representation compactness, achieving state-of-the-art performance across all benchmarks. Source code and models will be made publicly available.

[CV-49] HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

链接: https://arxiv.org/abs/2605.14877
作者: Jonathan Cederlund,Axel Berg,Durmus Alp Emre Acar,Chuteng Zhou,Pontus Giselsson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages total including appendix; 6 main-paper figures, 2 appendix figures; 4 tables

点击查看摘要

Abstract:Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves 2 \times higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation.

[CV-50] Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

链接: https://arxiv.org/abs/2605.14876
作者: Hanbo Cheng,Limin Lin,Ruo Zhang,Yicheng Pan,Jun Du
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose \Delta -Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14876 [cs.CV] (or arXiv:2605.14876v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.14876 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-51] LPH-VTON: Resolving the Structure-Texture Dilemma of Virtual Try-On via Latent Process Handover

链接: https://arxiv.org/abs/2605.14874
作者: Yixin Liu,Baihong Qian,Jinglin Jiang,Jeffery Wu,Yan Chen,Wei Wang,Yida Wang,Lanqing Yang,Guangtao Xue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Virtual Try-On (VTON) aims to synthesize photorealistic images of garments precisely aligned with a person’s body and pose. Current diffusion-based methods, however, face a fundamental trade-off between structural integrity and textural fidelity. In this paper, we formalize this challenge as a consequence of complementary inductive biases inherent in prevailing architectures: models heavily reliant on spatial constraints naturally favor geometric alignment but often suppress textures, whereas models dominated by unconstrained generative priors excel at vibrant detail rendering but are prone to structural drift. Based on this diagnosis, we propose LPH-VTON, a new synergistic framework that resolves this tension within a single, continuous denoising process. LPH-VTON strategically decomposes the generation, leveraging a structure-biased model to establish a geometrically consistent latent scaffold in the early stages, before handing over control to a texture-biased model for high-fidelity detail rendering. Extensive experiments validate our approach. Our model achieves a superior Pareto-optimal balance, establishing new benchmarks in perceptual faithfulness while maintaining highly competitive structural alignment across the standard dataset VITON-HD, proving the efficacy of temporal architectural decoupling.

[CV-52] FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

链接: https://arxiv.org/abs/2605.14854
作者: Patrick Kwon,Chen Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

[CV-53] SR-Prominence: A Crowdsourced Protocol and Dataset Suite for Perceptually-Weighted Super-Resolution Artifact Evaluation

链接: https://arxiv.org/abs/2605.14847
作者: Ivan Molodetskikh,Kirill Malyshev,Mark Mirgaleev,Nikita Zagainov,Evgeney Bogatyrev,Dmitriy Vatolin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modern image super-resolution methods generate detailed, visually appealing results, but they often introduce visual artifacts: unnatural patterns and texture distortions that degrade perceived quality. These defects vary widely in perceptual impact–some are barely noticeable, while others are highly disturbing–yet existing detection methods treat them equally. We propose artifact prominence as an evaluative target, defined as the fraction of viewers who judge a highlighted region to contain a noticeable artifact. We design a crowdsourced annotation protocol and construct SR-Prominence, a dataset suite containing 3,935 artifact masks from DeSRA, Open Images, Urban100, and a realistic no-ground-truth Urban100-HR setting, annotated with prominence. Re-annotating DeSRA reveals that 48.2% of its in-lab binary artifacts are not noticed by a majority of viewers. Across the suite, we audit SR artifact detectors, image-quality metrics, and SR methods. We find that classical full-reference metrics, especially SSIM and DISTS, provide surprisingly strong localized prominence signals, whereas no-reference IQA methods and specialized artifact detectors often fail to generalize across datasets and reference settings. SR-Prominence is released with an objective scoring protocol that allows new metrics to be benchmarked on our suite without further crowdsourcing. Together, the data and protocols enable SR artifact evaluation to move from binary defect presence toward perceptual impact. SR-Prominence is available at this https URL.

[CV-54] Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study

链接: https://arxiv.org/abs/2605.14845
作者: Marta Robledo-Moreno,Ruben Vera-Rodriguez,Ruben Tolosana,Javier Ortega-Garcia
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 14th International Workshop on Biometrics and Forensics

点击查看摘要

Abstract:Recent advancements in Vision-Language Models (VLMs) have demonstrated strong capabilities in general visual reasoning, yet their applicability to rigorous biometric tasks remains unexplored. This work presents an exploratory study evaluating the zero-shot performance of state-of-the-art VLMs (GPT-5.2 and Gemini 2.5 Pro) on the Signature Verification Challenge (SVC) benchmark. To enable visual processing, raw kinematic time-series are converted into static images, encoding pressure information into stroke opacity whenever available in the source data. Furthermore, we introduce a scoring protocol that extracts latent token probabilities to compute robust biometric scores. Experimental results reveal a significant performance dichotomy dependent on signal quality and forgery type. In random forgery scenarios, the zero-shot VLM achieves exceptional discrimination, with GPT-5.2 reaching an Equal Error Rate of 0.32% in mobile tasks, outperforming supervised state-of-the-art systems. Conversely, in skilled forgery scenarios, where the task is more challenging because both signatures are almost identical, the results are significantly worse, and a critical “Rationalization Trap” emerges: chain-of-thought (CoT) reasoning degrades performance as the model produces kinematic hallucinations to justify forgery artifacts as natural variability.

[CV-55] MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

链接: https://arxiv.org/abs/2605.14843
作者: Rahul Jain,Mayank Patel,Asim Unmesh,Karthik Ramani
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.

[CV-56] Editors Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis

链接: https://arxiv.org/abs/2605.14842
作者: Mor Ventura,Roy Hirsch,Yonatan Bitton,Regev Cohen,Roi Reichart
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans naturally communicate through abstract concepts like “mood”. However, current image editing benchmarks focus primarily on explicit, literal commands, leaving abstract instructions largely underexplored. In this work, we first formalize the definition and taxonomy of abstract image editing. To measure instruction-following in this challenging domain, we introduce Entity-Rubrics, a framework that breaks down abstract edits into individual, entity-level assessments and achieves strong correlation with human judgment. Alongside this framework, we contribute AbstractEdit, the first benchmark dedicated to abstract image editing across diverse real-world scenes. Evaluating 11 leading models on this dataset reveals a fundamental challenge: standard architectures struggle to balance intent and preservation, commonly defaulting to under-editing or over-editing. Our analysis demonstrates that driving meaningful improvements relies heavily on integrating advanced LLM text encoders and iterative thinking. Looking forward, our entity-based paradigm can generalize beyond assessment to serve as a reward model, enable models to correctly interpret abstract communication, or highlight specific failures in test-time critique loops. Ultimately, we hope this work serves as a stepping stone toward seamless multimodal interaction, closing the gap between rigid machine execution and the natural, open-ended way humans communicate.

[CV-57] Multi-proposal Collaboration and Multi-task Training for Weakly-supervised Video Moment Retrieval

链接: https://arxiv.org/abs/2605.14838
作者: Bolin Zhang,Chao Yang,Bin Jiang,Takahiro Komamizu,Ichiro Ide
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 26 pages, 4 figures. Preprint version of the article published in International Journal of Machine Learning and Cybernetics

点击查看摘要

Abstract:This study focuses on weakly-supervised Video Moment Retrieval (VMR), aiming to identify a moment semantically similar to the given query within an untrimmed video using only video-level correspondences, without relying on temporal annotations during training. Previous methods either aggregate predictions for all instances in the video, or indirectly address the task by proposing reconstructions for the query. However, these methods often produce low-quality temporal proposals, struggle with distinguishing misaligned moments in the same video, or lack stability due to a reliance on a single auxiliary task. To address these limitations, we present a novel weakly-supervised method called Multi-proposal Collaboration and Multi-task Training (MCMT). Initially, we generate multiple proposals and derive corresponding learnable Gaussian masks from them. These masks are then combined to create a high-quality positive sample mask, highlighting video clips most relevant to the query. Concurrently, we classify other clips in the same video as the easy negative sample and the entire video as the hard negative sample. During training, we introduce forward and inverse masked query reconstruction tasks to impose more substantial constraints on the network, promoting more robust and stable retrieval performance. Extensive experiments on two standard benchmarks affirm the effectiveness of the proposed method in VMR.

[CV-58] Learning Direct Control Policies with Flow Matching for Autonomous Driving ITSC2026

链接: https://arxiv.org/abs/2605.14832
作者: Marcello Ceresini,Federico Pirazzoli,Andrea Bertogalli,Lorenzo Cipelli,Filippo D’Addeo,Anthony Dell’Eva,Alessandro Paolo Capasso,Alberto Broggi
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 6 figures, 2 tables. Accepted at IEEE ITSC 2026

点击查看摘要

Abstract:We present a flow-matching planner for autonomous driving that directly outputs actionable control trajectories defined by acceleration and curvature profiles. The model is conditioned on a bird’s-eye-view (BEV) raster of the surrounding scene and generates control sequences in a small number of Ordinary Differential Equations (ODE) integration steps, enabling low-latency inference suitable for real-time closed-loop re-planning. We train exclusively on urban scenarios (real urban city streets, intersections and roundabouts of the city of Parma, Italy) collected from a 2D traffic simulator with reactive agents, and evaluate in closed-loop on both in-distribution and markedly out-of-distribution environments, including multi-lane highways and unseen urban scenarios. Our results show that the model generalizes reliably to these unseen conditions, maintaining stable closed-loop control and successfully completing scenarios that differ substantially from the training distribution. We attribute this to the BEV representation, which provides a geometry-centric view of the scene that is inherently less sensitive to distributional shifts, and to the flow-matching formulation, which learns a smooth vector field that degrades gracefully under distribution shift. We provide video demonstrations of closed-loop behavior at this https URL.

[CV-59] HDRFace: Rethinking Face Restoration with High-Dimensional Representation

链接: https://arxiv.org/abs/2605.14821
作者: Zirui Wang,Xianhui Lin,Yi Dong,Bo Wei,Gangjian Zhang,Siteng Ma,Zebiao Zheng,Xing Liu,Hong Gu,Minjing Dong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face restoration under complex degradations still remains an ill-posed inverse problem due to severe information loss. Although diffusion models benefit from strong generative priors, most methods still condition only on low-quality inputs, making it difficult to recover identity-critical details under heavy degradations. In this work, we propose HDRFace, a High-Dimensional Representation conditioned Face restoration framework that injects semantically rich priors into the conditional flow without modifying the generative backbone. Our pipeline first obtains a structurally reliable intermediate restoration with an off-the-shelf restorer, then uses a pretrained high-dimensional feature encoder to extract fine-grained facial representations from both the low-quality input and the intermediate result, and injects them as additional conditions for generation. We further introduce SDFM, a Structure-Detail aware adaptive Fusion Mechanism that emphasizes global constraints during structure modeling and strengthens representation guidance during detail synthesis, balancing structural consistency and detail fidelity. To validate the generalization ability of our method, we implement the proposed framework on two generative models, SD V2.1-base and Qwen-Image, and consistently observe stable and coherent performance gains across different architectures.

[CV-60] he Velocity Deficit: Initial Energy Injection for Flow Matching ICML2026

链接: https://arxiv.org/abs/2605.14819
作者: Linze Li,Zong-Wei Hong,Shen Zhang,Bo Lin,Jinglun Li,Yao Tang,Jiajun Liang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML2026

点击查看摘要

Abstract:While Flow Matching theoretically guarantees constant-velocity trajectories, we identify a critical breakdown in high-dimensional practice: the Velocity Deficit. We show that the MSE objective systematically underestimates velocity magnitude, causing generated samples to fail to reach the data manifold-a phenomenon we term Integration Lag. To rectify this, we propose Initial Energy Injection, instantiated via two complementary methods: the training-based Magnitude-Aware Flow Matching (MAFM) and the training-free Scale Schedule Corrector (SSC). Both are grounded in our discovery of a crucial asymmetry: velocity contraction causes harmful kinetic stagnation at the trajectory’s start, yet acts as a beneficial denoising mechanism at its end. Empirically, SSC yields significant efficiency gains with zero retraining and just one line of code. On ImageNet-1k (256x256), it improves FID by 44.6% (from 13.68 to 7.58) and achieves a 5x speedup, enabling a 50-step generator (FID 7.58) to beat a 250-step baseline (FID 8.65). Furthermore, our methods generalize to Text-to-Image tasks and high-resolution generation, improving FID on MS-COCO by ~22%.

[CV-61] Probing into Camera Control of Video Models

链接: https://arxiv.org/abs/2605.14815
作者: Chen Hou,Christian Rupprecht
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.

[CV-62] SuperADD: Training-free Class-agnostic Anomaly Segmentation – CVPR 2026 VAND 4.0 Workshop Challenge Industrial Track CVPR2026

链接: https://arxiv.org/abs/2605.14808
作者: Lukas Roming,Felix Lehnerer,Jonas V. Funk,Andreas Michel,Georg Maier,Thomas Längle,Jürgen Beyerer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report for the CVPR 2026 VAND 4.0 workshop challenge industrial track

点击查看摘要

Abstract:Visual anomaly detection (AD) for industrial inspection is a highly relevant task in modern production environments. The problem becomes particularly challenging when training and deployment data differ due to changes in acquisition conditions during production. In the VAND 4.0 Industrial Track, models must remain robust under distribution shifts such as varying illumination and their performance is assessed on the MVTec AD 2 dataset. To address this setting, we propose a training-free and class-agnostic anomaly detection pipeline based on the work of SuperAD. Our approach improves generalization through several modifications designed to enhance robustness under distribution shifts. These adaptations include using a DINOv3 backbone, overlapping patch-wise processing, intensity-based augmentations, improved memory-bank subsampling for better coverage of the data distribution, and iterative morphological closing for cleaner and more spatially consistent anomaly maps. Unlike methods that rely on class-specific architectures or per-class hyperparameter tuning, our method uses a single architecture and one shared hyperparameter configuration across all object classes. This makes the approach well suited for industrial deployment, where product variants and appearance changes must be handled with minimal adaptation effort. We achieve segmentation F1 scores of 62.61% , 57.42% , and 54.35% on test public, private, and private mixed of MVTec AD 2 respectively, thereby outperforming SuperAD and other state-of-the-art methods. Code is available at this https URL.

[CV-63] Can Visual Mamba Improve AI-Generated Image Detection? An In-Depth Investigation

链接: https://arxiv.org/abs/2605.14799
作者: Mamadou Keita,Wassim Hamidouche,Hessen Bougueffa Eutamene,Abdelmalik Taleb-Ahmed,Xianxun Zhu,Abdenour Hadid
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba’s strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.

[CV-64] COAL: Counterfactual and Observation-Enhanced Alignment Learning for Discriminative Referring Multi-Object Tracking

链接: https://arxiv.org/abs/2605.14795
作者: Shukun Jia,Shiyu Hu,Yipei Wang,Ximeng Cheng,Yichao Cao,Xiaobo Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Referring Multi-Object Tracking (RMOT) faces a fundamental structural contradiction between the high-discriminability demand and the sparse semantic supervision. This mismatch is particularly acute in highly homogeneous scenarios that require fine-grained discrimination over complex compositional semantics. However, under sparse supervision, models overfit to salient yet insufficient cues, thereby encouraging shortcut learning and semantic collapse. To resolve this, we propose COAL (Counterfactual and Observation-enhanced Alignment Learning), a framework that advances RMOT beyond isolated structural optimization through knowledge regularization. First, we introduce Explicit Semantic Injection (ESI) via a VLM to densify the observation space and enhance instance discriminability. Second, leveraging LLM reasoning, we propose Counterfactual Learning (CFL) to augment supervision, enforcing strict attribute verification for robust compositional recognition. These strategies are unified within a Hierarchical Multi-Stream Integration (HMSI) architecture, which distills external knowledge into domain-specific discriminative representations. Experiments on Refer-KITTI and Refer-KITTI-V2 benchmarks validate COAL’s efficacy. Notably, it surpasses the state-of-the-art by 7.28% HOTA on the highly challenging Refer-KITTI-V2. These results demonstrate the effectiveness of knowledge regularization for resolving the sparsity-discriminability paradox in RMOT.

[CV-65] Understanding Imbalanced Forgetting in Rehearsal-Based Class-Incremental Learning

链接: https://arxiv.org/abs/2605.14785
作者: Alberto Tamajo,Srinandan Dasmahapatra,Rahman Attar
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 37 pages; 24 tables; 7 figures; submitted to a journal

点击查看摘要

Abstract:Neural networks suffer from catastrophic forgetting in class-incremental learning (CIL) settings. Rehearsal \unicodex2013 replaying a subset of past samples \unicodex2013 is a well-established mitigation strategy. However, recent results suggest that, despite balanced rehearsal allocation, some classes are forgotten substantially more than others. Despite its relevance, this imbalanced forgetting phenomenon remains underexplored. This work shows that imbalanced forgetting arises systematically and severely in rehearsal-based CIL and investigates it extensively. Specifically, we construct, from a principled analysis, three last-layer coefficients that capture different gradient-level sources of interference affecting each past class during an incremental step. We then demonstrate that, together, they reliably predict how past classes will rank in terms of forgetting at the end of that step. While predictive performance alone does not establish causality, these results support the interpretation of the coefficients as a plausible mechanistic account linking last-layer gradient-level interactions during training to class-level forgetting outcomes. Notably, one coefficient \unicodex2013 capturing self-induced interference \unicodex2013 emerges as the strongest predictor, with controlled experiments providing evidence consistent with this coefficient being influenced by the new-class interference coefficient. Overall, our findings provide valuable insights and suggest promising directions for mitigating imbalanced forgetting by reducing class-wise disparities in the identified sources of interference.

[CV-66] MonoPRIO: Adaptive Prior Conditioning for Unified Monocular 3D Object Detection

链接: https://arxiv.org/abs/2605.14781
作者: Leon Davies,Qinggang Meng,Mohamad Saada,Baihua Li,Simon Sølvsten
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, 8 tables. Submitted to Pattern Recognition. Code and reproducibility material available at this https URL

点击查看摘要

Abstract:Monocular 3D object detection remains challenging because metric size and depth are underdetermined by single-view evidence, particularly under occlusion, truncation, and projection-induced scale-depth ambiguity. Although recent methods improve depth and geometric reasoning, metric size remains unstable in unified multi-class settings, where class variability and partial visibility broaden plausible size modes. We propose MonoPRIO, a unified monocular 3D detector that targets this bottleneck through adaptive prior conditioning in the size pathway. MonoPRIO constructs class-aware size prototypes offline, routes each decoder query to a soft mixture prior, applies uncertainty-aware log-space conditioning, and uses Cluster-Aligned Prior (CAP) regularisation on matched positives during training. On the official KITTI test server, MonoPRIO achieves the strongest fully reported unified multi-class result among methods reporting complete Car, Pedestrian, and Cyclist metrics. In the car-only setting, it also achieves the strongest 3D bounding-box AP across Easy/Moderate/Hard categories among compared methods without extra data, while using substantially less compute than MonoCLUE. Ablations and diagnostics show complementary gains from routed injection and CAP, with the largest benefits in ambiguity-prone, partially occluded, and low-data regimes. These findings indicate that adaptive priors are most effective when image evidence underdetermines metric size, while atypical geometry or extreme visibility loss can still cause mismatch between routed priors and true instance geometry. Code, trained models, result logs, and reproducibility material are available at this https URL.

[CV-67] BioHuman: Learning Biomechanical Human Representations from Video

链接: https://arxiv.org/abs/2605.14772
作者: Yujun Huo,He Zhang,Chentao Song,Honglin Song,Zongyu Zuo,Tao Yu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding human motion beyond surface kinematics is crucial for motion analysis, rehabilitation, and injury risk assessment. However, progress in this domain is limited by the lack of large-scale datasets with biomechanical annotations, and by existing approaches that cannot directly infer internal biomechanical states from visual observations. In this paper, we introduce a simulation-based framework for estimating muscle activations from existing motion capture datasets, resulting in BioHuman10M, a large-scale dataset with synchronized video, motion, and activations. Building on BioHuman10M, we propose BioHuman, an end-to-end model that takes monocular video as input and jointly predicts human motion and muscle activations, effectively bridging visual observations and internal biomechanical states. Extensive experiments demonstrate that BioHuman enables accurate reconstruction of both kinematic motion and muscle activity, and generalizes across diverse subjects and motions. We believe our approach establishes a new benchmark for video-based biomechanical understanding and opens up new possibilities for physically grounded human modeling.

[CV-68] EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding ICML2026

链接: https://arxiv.org/abs/2605.14742
作者: Yuejiao Su,Xinshen Zhang,Zhen Ye,Lei Yao,Lap-Pui Chau,Yi Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at ICML 2026. Project page: this https URL

点击查看摘要

Abstract:Understanding human–environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.

[CV-69] Video-Zero: Self-Evolution Video Understanding

链接: https://arxiv.org/abs/2605.14733
作者: Ruixu Zhang,Deyi Ji,Lanyun Zhu,Xuanyi Liu,Yuxin Meng,Ruihang Chu,Yujiu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can therefore produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence. In this work, we argue that the key bottleneck of video self-evolution is not difficulty alone, but grounding. We propose Video-Zero, an annotation-free Questioner–Solver co-evolution framework that centers self-evolution on temporally localized evidence. The Questioner discovers informative evidence segments and generates evidence-grounded questions, while the Solver learns to answer and align its predictions with the supporting evidence. This closes an iterative loop of evidence discovery, grounded supervision, and evidence-aligned learning. Across 13 benchmarks spanning temporal grounding, long-video understanding, and video reasoning, Video-Zero consistently improves multiple video VLM backbones, demonstrating the effectiveness and transferability of evidence-centered self-evolution.

[CV-70] UMo: Unified Sparse Motion Modeling for Real-Time Co-Speech Avatars

链接: https://arxiv.org/abs/2605.14731
作者: Xiaoyu Zhan,Xinyu Fu,Chenghao Yang,Xiaohong Zhang,Dongjie Fu,Pengcheng Fang,Tengjiao Sun,Xiaohao Cai,Hansung Kim,Yuanqi Li,Jie Guo,Yanwen Guo
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech-driven gestures and facial animations are fundamental to expressive digital avatars in games, virtual production, and interactive media. However, existing methods are either limited to a single modality for audio motion alignment, failing to fully utilize the potential of massive human motion data, or are constrained by the representation ability and throughput of multimodal models, which makes it difficult to achieve high-quality motion generation or real-time performance. We present UMo, a unified sparse motion modeling architecture for real-time co-speech avatars, which processes text, audio, and motion tokens within a unified formulation. Leveraging a spatially sparse Mixture-of-Experts framework and a temporally sparse, keyframe-centric design, UMo efficiently performs real-time dense reconstruction, enabling temporally coherent and high-fidelity animation generation for both facial expressions and gestures. Furthermore, we implement a multi-stage training strategy with targeted audio augmentation to enhance acoustic diversity and semantic consistency. Consequently, UMo preserves fine-grained speech-motion alignment even under strict latency constraints. Extensive quantitative and qualitative evaluations show that UMo achieves better output quality under low latency and real-time performance constraints, offering a practical solution for high-fidelity real-time co-speech avatars.

[CV-71] CHASM: Cross-frequency Harmonized Axis-Separable Mixing for Spectral Token Operators

链接: https://arxiv.org/abs/2605.14727
作者: Pengcheng Fang,Hongli Chen,Yuxia Chen,Tengjiao Sun,Jiaxin Liu,Xiaohao Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spectral token mixers based on Fourier transforms provide an efficient way to model global interactions in visual feature maps. Existing designs often either apply filter-wise spectral responses along fixed channel axes, or learn adaptive frequency-indexed channel mixing without explicitly aligning the channel directions used across frequencies. We propose CHASM, a Cross-frequency Harmonized Axis-Separable Mixer, as a structured middle ground. CHASM separates what should be shared from what should remain frequency-specific: all frequencies share a learned channel eigenbasis, while each frequency retains its own positive spectral gains. The shared basis makes channel directions comparable across the spectrum, whereas the positive gains preserve local spectral adaptivity. CHASM applies this structured operator separably along the height and width axes and is used as a drop-in replacement mixer inside existing backbones. We provide a structural characterization of the shared-basis operator family and evaluate CHASM through controlled same-backbone comparisons. Across accelerated MRI reconstruction, undersampled MRI segmentation, and natural-image reconstruction, CHASM consistently improves over same-backbone spectral-mixer baselines. Ablations show that removing the shared-basis constraint weakens performance, and randomizing coherent sampling geometry substantially reduces the gain, supporting cross-frequency harmonization as a useful inductive bias for spectral token operators.

[CV-72] owards Label-Free Single-Cell Phenotyping Using Multi-Task Learning ICPR

链接: https://arxiv.org/abs/2605.14717
作者: Saqib Nazir,Ardhendu Behera
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in 28th International Conference on Pattern Recognition (ICPR) 2026

点击查看摘要

Abstract:Label-free single-cell imaging offers a scalable, non-invasive alternative to fluorescence-based cytometry, yet inferring molecular phenotypes directly from bright-field morphology remains challenging. We present a unified Deep Learning (DL) framework that jointly performs White Blood Cell (WBC) classification and continuous protein-expression regression from label-free Differential Phase Contrast (DPC) images. Our model employs a Hybrid architecture that fuses convolutional fine-grained texture features with transformer-based global representations through a learnable cross-branch gating module, enabling robust morpho-molecular inference from DPC images. To support downstream interpretability, we further incorporate a Large Language Model (LLM) that generates concise, biologically grounded summaries of the predicted cell states. Experiments on the Berkeley Single Cell Computational Microscopy (BSCCM) and Blood Cells Image benchmarks demonstrate strong performance, achieving a 91.3% WBC classification accuracy and a 0.72 Pearson correlation for CD16 expression regression on BSCCM. These results underscore the promise of label-free single-cell imaging for cost-effective hematological profiling, enabling simultaneous phenotype identification and quantitative biomarker estimation without fluorescent staining. The source code is available at this https URL.

[CV-73] AnchorRoute: Human Motion Synthesis with Interval-Routed Sparse Contro

链接: https://arxiv.org/abs/2605.14716
作者: Pengcheng Fang,Tengjiao Sun,Dongjie Fu,Xiaoyu Zhan,Yanwen Guo,Hansung Kim,Xiaohao Cai
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sparse anchors provide a compact interface for human motion authoring: users specify a few root positions, planar trajectory samples, or body-point targets, while the system synthesizes the full-body motion that completes the under-specified intent. We present AnchorRoute, a sparse-anchor motion synthesis framework that uses anchors as a shared scaffold for both generation and refinement. Before generation, AnchorRoute converts sparse anchors into anchor-condition features and injects the resulting condition memory into a frozen Transition Masked Diffusion prior through AnchorKV and dual-context conditioning. This preserves the generation quality of the pretrained text-to-motion prior while learning sparse spatial control. After generation, the same anchors are evaluated as residuals: their timestamps define refinement intervals, and their residuals determine where correction should be concentrated. RouteSolver then refines the motion by projecting soft-token updates onto anchor-defined piecewise-affine interval bases. This couples generation-time anchor conditioning with residual-routed refinement under one anchor scaffold. AnchorRoute supports root-3D, planar-root, and body-point control within the same formulation. In benchmark evaluations, AnchorRoute outperforms prior sparse-control methods under the sparse keyjoint protocol and consistently improves anchor adherence across control families. The results show that the learned anchor-conditioned generator and RouteSolver refinement are complementary: the generator preserves text-motion quality, while RouteSolver provides a controllable path toward stronger anchor adherence.

[CV-74] Vision-Core Guided Contrastive Learning for Balanced Multi-modal Prognosis Prediction of Stroke

链接: https://arxiv.org/abs/2605.14710
作者: Liren Chen,Lidong Sun,Mingyan Huang,Junzhe Tang,Yinghui Zhu,Guanjie Wang,Yiqing Xia,Ting Xiao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Corresponding author: Ting Xiao

点击查看摘要

Abstract:Deep learning and multi-modal fusion have demonstrated transformative potential in medical diagnosis by integrating diverse data sources. However, accurate prognosis for ischemic stroke remains challenging due to limitations in existing multi-modal approaches. First, current methods are predominantly confined to dual-modal fusion, lacking a framework that effectively integrates the trifecta of medical images, structured clinical data, and unstructured text. Second, they often fail to establish deep bidirectional interactions between modalities; To address these critical gaps, this paper proposes a novel tri-modal fusion model for ischemic stroke prognosis. Our approach first enriches the data representation by employing a Large Language Model (LLM) to automatically generate semi-structured diagnostic text from brain MRIs. This process not only addresses the scarcity of expert annotations but also serves as a regularized semantic enhancement, improving multimodal fusion robustness. Furthermore, we design a core component termed the Vision-Conditioned Dual Alignment Fusion Module (VDAFM), which strategically uses visual features as a conditional prior to guide fine-grained interaction with the generated text. This module achieves a dynamic and profound fusion through a dual semantic alignment loss, effectively mitigating modal heterogeneity. Extensive experiments on a real-world clinical dataset demonstrate that our model achieves state-of-the-art performance.

[CV-75] Breaking Dual Bottlenecks: Evolving Unified Multimodal Models into Self-Adaptive Interleaved Visual Reason ers ICML2026

链接: https://arxiv.org/abs/2605.14709
作者: Qingyang Liu,Bingjie Gao,Canmiao Fu,Zhipeng Huang,Chen Li,Feng Wang,Shuochen Chang,Shaobo Wang,Yali Wang,Keming Ye,Jiangtong Li,Li Niu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Recent unified models integrate multimodal understanding and generation within a single framework. However, an “understanding-generation gap” persists, where models can capture user intent but often fail to translate this semantic knowledge into precise pixel-level manipulation. This gap results in two bottlenecks in anything-to-image task (X2I): the attention entanglement bottleneck, where blind planning struggles with complex prompts, and the visual refinement bottleneck, where unstructured feedback fails to correct imperfections efficiently. In this paper, we propose a novel framework that empowers unified models to autonomously switch between generation strategies based on instruction complexity and model capability. To achieve this, we construct a hierarchical data pipeline that constructs execution paths across three adaptive modes: direct generation for simple cases, self-reflection for quality refinement, and multi-step planning for decomposing complex scenarios. Building on this pipeline, we contribute a high-quality dataset with over 50,000 samples and implement a two-stage training strategy comprising SFT and RL. Specifically, we design step-wise reasoning rewards to ensure logical consistency and intra-group complexity penalty to prevent redundant computational overhead. Extensive experiments demonstrate that our method outperforms existing baselines on X2I, achieving superior generation fidelity among simple-to-complex instructions. The code is released at this https URL.

[CV-76] StyleTextGen: Style-Conditioned Multilingual Scene Text Generation CVPR2026

链接: https://arxiv.org/abs/2605.14708
作者: Zeyu Chen,Fangmin Zhao,Yan Shu,Yichao Liu,Liu Yu,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted to CVPR 2026

点击查看摘要

Abstract:Style-conditioned scene text generation faces unique challenges in extracting precise text styles from complex backgrounds and maintaining fine-grained style consistency across characters, especially for multilingual scripts. We propose StyleTextGen, a novel framework that learns to perceive and replicate visual text styles across different languages and writing systems. Our approach features three key contributions: First, we introduce a dual-branch style encoder dedicated to style modeling, yielding robust multilingual text style representations in complex real-world scenes. Second, we design a text style consistency loss that enhances style coherence and improves overall visual quality. Third, we develop a mask-guided inference strategy that ensures precise style alignment between generated and reference text. To facilitate systematic evaluation, we construct StyleText-CE, a bilingual scene text style benchmark covering both monolingual and cross-lingual settings. Extensive experiments demonstrate that StyleTextGen significantly outperforms existing methods in style consistency and cross-lingual generalization, establishing new state-of-the-art performance in multilingual style-conditioned text generation.

[CV-77] owards Continuous Sign Language Conversation from Isolated Signs

链接: https://arxiv.org/abs/2605.14705
作者: Youngmin Kim,Kyobin Choo,Jiwoo Park,Minseo Kim,Chanyoung Kim,Junhyeok Kim,Seong Jae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.

[CV-78] SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization

链接: https://arxiv.org/abs/2605.14704
作者: Posheng Chen,Powen Cheng,Gueter Josmy Faure,Hung-Ting Su,Winston H. Hsu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce SceneFunRI, a benchmark for Reasoning the Invisible. Based on the SceneFun3D dataset, SceneFunRI formulates the task as a 2D spatial reasoning problem via a semi-automatic pipeline and comprises 855 instances. It requires models to infer the locations of invisible functional objects from task instructions and commonsense reasoning. The strongest baseline model (Gemini 3 Flash) only achieves an CAcc@75 of 15.20, an mIoU of 0.74, and a Dist of 28.65. We group our prompting analysis into three categories: Strong Instruction Prompting, Reasoning-based Prompting, and Spatial Process of Elimination (SPoE). These findings indicate that invisible-region reasoning remains an unstable capability in current VLMs, motivating future work on models that more tightly integrate task intent, commonsense priors, spatial grounding, and uncertainty-aware search.

[CV-79] Generating HDR Video from SDR Video

链接: https://arxiv.org/abs/2605.14703
作者: SaiKiran Tedla,Francesco Banterle,Trevor Canham,Karanpreet Raja,David B. Lindell,Kiriakos N. Kutulakos,Jiacheng Li,Feiran Li,Daisuke Iso
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The high dynamic range (HDR) video ecosystem is approaching maturity, but the problem of upconverting legacy standard dynamic range (SDR) videos persists without a convincing solution. We propose a framework for HDR video synthesis from casual SDR footage by leveraging large-scale generative video models. We introduce a Multi-Exposure Video Model (MEVM) that can predict exposure-bracketed linear SDR video sequences from a single nonlinear SDR video input. We further propose a learnable Video Merging Model (VMM) that merges the predicted exposure-bracketed video into a high-quality HDR sequence while preserving detail in both shadows and highlights. Extensive experiments, quantitative and qualitative evaluation, and a user study demonstrate that our approach enables robust HDR conversion for in-the-wild examples from casual consumer videos and even iconic films. Finally, our model can support HDR synthesis pipelines built upon existing SDR generative video models. Output HDR videos can be viewed on our supplementary webpage: this http URL

[CV-80] EponaV2: Driving World Model with Comprehensive Future Reasoning

链接: https://arxiv.org/abs/2605.14696
作者: Jiawei Xu,Zhizhou Zhong,Zhijian Shu,Mingkai Jia,Mingxiao Li,Jia-Wang Bian,Qian Zhang,Kaicheng Zhang,Jin Xie,Jian Yang,Wei Yin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

[CV-81] Are Candidate Models Really Needed for Active Learning?

链接: https://arxiv.org/abs/2605.14689
作者: Harshini Mridula Mohan,Maanya Manjunath,Vipul Arya,S.H. Shabbeer Basha,Nitin Cheekatla
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Computer Vision and Image Understanding (CVIU)

点击查看摘要

Abstract:Deep learning has profoundly impacted domains such as computer vision and natural language processing by uncovering complex patterns in vast datasets. However, the reliance on extensive labeled data poses significant challenges, including resource constraints and annotation errors, particularly in training Convolutional Neural Networks (CNNs) and transformers due to a larger number of parameters. Active learning offers a promising solution to reduce labeling burdens by strategically selecting the most informative samples for annotation. However, the current active learning frameworks are time-intensive which select the samples iteratively with the help of initial candidate models. This study investigates the feasibility of using CNNs and transformers with randomly initialized weights, eliminating the need for initial candidate models while achieving results comparable to active learning frameworks that depend on such candidate models. We evaluate three confidence-based sampling strategies: high confidence (HC), low confidence (LC), and a combination of high confidence in the early stages of training and low confidence at later stages of training (HCLC). Among these, mostly LC demonstrated the best performance in our experiments, showcasing its effectiveness as an active learning strategy without the need for candidate models. Further, extensive experiments verify the robustness of the proposed active learning methods. By challenging traditional frameworks, the proposed work introduces a streamlined approach to active learning, advancing efficiency and flexibility across diverse datasets and domains.

[CV-82] MiVE: Multiscale Vision-language features for reference-guided video Editing ICML2026

链接: https://arxiv.org/abs/2605.14664
作者: Tong Wang,Meng Zou,Chengjing Wu,Xiaochao Qu,Luoqi Liu,Xiaolin Hu,Ting Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026

点击查看摘要

Abstract:Reference-guided video editing takes a source video, a text instruction, and a reference image as inputs, requiring the model to faithfully apply the instructed edits while preserving original motion and unedited content. Existing methods fall into two paradigms, each with inherent limitations: decoupled encoders suffer from modality gaps when processing instructions and visual content independently, while unified vision-language encoders lose fine-grained spatial details by relying solely on final-layer representations. We observe that VLM layers encode complementary information hierarchically – early layers capture localized spatial details essential for precise editing, while deeper layers encode global semantics for instruction comprehension. Building on this insight, we present MiVE (Multiscale Vision-language features for reference-guided video Editing), a framework that repurposes VLMs as multiscale feature extractors. MiVE extracts hierarchical features from Qwen3-VL and integrates them into a unified self-attention Diffusion Transformer, eliminating the modality mismatch inherent in cross-attention designs. Experiments demonstrate that MiVE achieves state-of-the-art performance by ranking highest in human preference, outperforming both academic methods and commercial systems.

[CV-83] Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging ICML2026

链接: https://arxiv.org/abs/2605.14654
作者: Tan Pan,Shuhao Mei,Yixuan Sun,Kaiyu Guo,Chen Jiang,Zhaorui Tan,Mengzhu Li,Limei Han,Xiang Zou,Yuan Cheng,Mahsa Baktashmotlagh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML2026

点击查看摘要

Abstract:Self-supervised pre-training methods in medical imaging typically treat each individual as an isolated instance, learning representations through augmentation-based objectives or masked reconstruction. They often do not adequately capitalize on a key characteristic of physiological features: anatomical structures maintain consistent spatial relationships across individuals (instances), such as the thalamus being medial to the basal ganglia, regardless of variations in brain size, shape, or pathology. We propose leveraging this cross-instance topological consistency as a supervisory signal. The challenge arises from the inherent variability in medical imaging, which can differ significantly across instances and modalities. To tackle this, we focus on two alignment regimes. (i) Intra-instance: with pixel-level correspondences available, a cross-modal triplet objective explicitly preserves local neighborhood topology. (ii) Inter-instance: without such supervision, we derive pseudo-correspondences to control partial neighborhood alignment and prevent topology collapse across modalities. We validate our approach across 7 downstream multi-modal tasks, achieving average improvements of 1.1% and 5.94% in segmentation and classification tasks, respectively, and demonstrating significantly better robustness when modalities are missing at test time.

[CV-84] ERRA-CD: Multi-Temporal Framework for Multi-class and Semantic Change Detection

链接: https://arxiv.org/abs/2605.14651
作者: Omkar Oak,Rukmini Nazre,Rujuta Budke,Suraj Sawant
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper presented at 11th International Congress on Information and Communication Technology (ICICT) 2026, London

点击查看摘要

Abstract:Urban vegetation monitoring plays a vital role in understanding environmental changes, yet comprehensive datasets for this purpose remain limited. To address this gap, we present the Temporal Remote-sensing Repository for Analyzing Change Detection (TERRA-CD), a benchmark dataset comprising 5,221 Sentinel-2 image pairs from 2019 and 2024, covering 232 cities across the USA and Europe. The dataset features three distinct annotation schemes: 4-class land cover mapping masks, 3-class vegetation change masks, and 13-class semantic change masks capturing all possible land cover transitions. Using various deep learning approaches including Siamese networks, STANet variants, Bi-SRNet, Changemask, Post-Classification Comparison, and HRSCD strategies, we evaluated the dataset’s effectiveness for both vegetation Multi-class Change Detection as well as Semantic Change Detection. The proposed dataset and methods are available at this https URL.

[CV-85] Vision-Based Water Level and Flow Estimation

链接: https://arxiv.org/abs/2605.14645
作者: ZhiXin Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid evolution of computer vision, vision-based methodologies for water level and river surface velocity estimation have reached significant maturity. Compared to traditional sensing, these techniques offer superior interpretability, automated data archiving, and enhanced system robustness. However, challenges such as environmental sensitivity, limited precision, and complex site calibration persist. This work proposes an integrated framework that synergizes state-of-the-art (SOTA) vision models with statistical modeling. By leveraging physical priors and robust filtering strategies, we improve the accuracy of water level detection and flow estimation. Code will be available at this https URL

[CV-86] How to Evaluate and Refine your CAM ICPR2026

链接: https://arxiv.org/abs/2605.14641
作者: Luca Domeniconi,Alessandra Stramiglio,Michele Lombardi,Samuele Salti
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICPR 2026

点击查看摘要

Abstract:Class attribution maps (CAMs) provide local explanations for the decisions of convolutional neural networks. While widely used in practice, the evaluation of CAMs remains challenging due to the lack of ground-truth explanations, making it difficult to evaluate the soundness of existing metrics. Independently, most commonly used CAM methods produce low-resolution attribution maps, which limits their usefulness for detailed interpretability. To address the evaluation challenge, we introduce a synthetic dataset with ground-truth attributions that enables a rigorous comparison of CAM evaluation metrics. Using this dataset, we analyze existing metrics and propose ARCC, a new composite metric that more reliably identifies faithful explanations. To address the low resolution issue, we introduce RefineCAM, a method that produces high-resolution attribution maps by aggregating CAMs across multiple network layers. Our results show that RefineCAM consistently outperforms existing methods according to the proposed evaluation. Comments: Accepted at ICPR 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14641 [cs.CV] (or arXiv:2605.14641v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.14641 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-87] MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models

链接: https://arxiv.org/abs/2605.14635
作者: Tianwei Chen,Takuya Furusawa,Yuki Hirakawa,Ryotaro Shimizu,Mo Fan,Takashi Wada
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a multi-label visual emotion analysis benchmark dataset for comprehensively evaluating the ability of multimodal large language models (MLLMs) to predict the emotions evoked by images. Recent user studies report an unintuitive finding: humans may prefer the predictions of MLLMs over the labels in existing datasets. We argue that this phenomenon stems from the suboptimal annotation scheme used in existing datasets, where each annotator is shown a single candidate emotion for each image and judges whether it is evoked or not. This approach is clearly limited because a single image can evoke multiple emotions with varying intensities. As a result, evaluations based on these datasets may underestimate the capabilities of MLLMs, yet an appropriate benchmark for evaluating such models remains lacking. To address this issue, we introduce a new multi-label benchmark dataset for visual emotion analysis toward MLLMs evaluation. We hire 20 annotators per image and ask them to select all emotions they feel from an image. Then, we aggregate the votes across all annotators, providing a more reliable and representative dataset labeled with a distribution of emotions. The resulting dataset contains 10,344 images with 236,998 valid votes across eight emotions. Based on this benchmark dataset, we evaluate several recent models, including Qwen3-VL, OpenAI’s GPT, Gemini, and Claude. We assess model performance on both dominant emotion prediction and emotion distribution prediction. Our results demonstrate the progress achieved by recent MLLMs while also indicating that substantial room for improvement remains. Furthermore, our experiments with LLM-as-a-judge show that the method does not consistently improve MLLMs’ performance, indicating its limitations for the subjective task of visual emotion analysis.

[CV-88] Action-Inspired Generative Models

链接: https://arxiv.org/abs/2605.14631
作者: Eshwar R. A.,Debnath Pal
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, and 4 tables

点击查看摘要

Abstract:We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies along a structurally coherent trajectory or a degenerate one. We address this by introducing a lightweight learned scalar potential V_\phi that scores bridge samples online and modulates the drift objective via importance weights derived through a stop-gradient barrier – preventing adversarial feedback between the two networks whilst preserving V_\phi 's guiding signal. Crucially, V_\phi comprises only \sim 1.4% of the primary drift network’s parameter count, adds no overhead to the inference graph, and requires no iterative half-bridge fitting or auxiliary stochastic differential equation (SDE) solvers: it is a plug-and-play enhancement to any bridge-matching training loop. At inference, V_\phi is discarded entirely, leaving standard Euler-Maruyama integration of the exponential moving average (EMA) drift. We demonstrate that selectively penalising uninformative transport paths through the learned potential yields consistent improvements in generation quality across fidelity and coverage metrics.

[CV-89] UniTriGen: Unified Triplet Generation of Aligned Visible-Infrared-Label for Few-Shot RGB-T Semantic Segmentation

链接: https://arxiv.org/abs/2605.14626
作者: Ping Zhou,Haoyu Wang,Mengmeng Zheng,Lei Zhang,Wei Wei,Chen Ding,Fei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-T semantic segmentation requires strictly aligned VIS-IR-Label triplets; however, such aligned triplet data are often scarce in real-world scenarios. Existing generative augmentation methods usually adopt cascaded generation paradigms, decomposing joint triplet generation into local conditional processes. As a result, consistency among VIS, IR, and Label in spatial structure, semantic content, and cross-modal details cannot be reliably maintained. To address this issue, we propose UniTriGen, a unified triplet generation framework that directly generates spatially aligned, semantically consistent, and modality complementary VIS-IR-Label triplets under the guidance of text prompts. UniTriGen first introduces a unified triplet generation mechanism, where VIS, IR, and Label are jointly encoded into a shared latent space and modeled with a diffusion process to enforce global cross-modal consistency. Lightweight modality-specific residual adapters are further integrated into this mechanism to accommodate modality-specific imaging characteristics and output formats. To mitigate generation bias caused by imbalanced scene and class distributions in limited paired triplets, UniTriGen also employs a scene-balanced and class-aware few-shot sampling strategy, which induces a more balanced sampling distribution and enhances the scene and class diversity of generated triplets. Experiments show that UniTriGen generates high-quality aligned triplets from limited real paired data, thereby achieving consistent performance improvements across various RGB-T semantic segmentation models.

[CV-90] CalibAnyView: Beyond Single-View Camera Calibration in the Wild

链接: https://arxiv.org/abs/2605.14615
作者: Boying Li,Cheng Zhang,Weirong Chen,Daniel Cremers,Ian Reid,Hamid Rezatofighi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 44 pages, 25 figures

点击查看摘要

Abstract:Camera calibration is a fundamental prerequisite for reliable geometric perception, yet classical approaches rely on controlled acquisition setups that are impractical for in-the-wild imagery. Recent learning-based methods have shown promising results for single-view calibration, but inherently neglect geometric consistency across multiple views. We introduce CalibAnyView, a unified formulation that supports an arbitrary number of input views ( N \geq 1 ) by explicitly modeling cross-view geometric consistency. To facilitate this, we construct a large-scale multi-view video dataset covering diverse real-world scenarios, including multiple camera models, dynamic scenes, realistic motion trajectories, and heterogeneous lens distortions. Building on this dataset, we develop a multi-view transformer that predicts dense perspective fields, which are further integrated into a geometric optimization framework to jointly estimate camera intrinsics and gravity direction. Extensive experiments demonstrate that CalibAnyView consistently outperforms state-of-the-art methods, achieves strong robustness under single-view settings, and further improves with multi-view inference, providing a reliable foundation for downstream tasks such as 3D reconstruction and robotic perception in the wild.

[CV-91] Deep Image Segmentation via Discriminant Feature Learning ICIP2026

链接: https://arxiv.org/abs/2605.14609
作者: Adam Dawid Sztamborski,Raül Pérez-Gonzalo,Antonio Agudo
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to ICIP 2026

点击查看摘要

Abstract:Accurate image segmentation remains challenging, particularly in generating sharp, confident boundaries. While modern architectures have advanced the field, many of them still rely on standard loss functions like Cross-Entropy and Dice, which often neglect the discriminative structure of learned features, leading to inaccurate boundaries. This work introduces Deep Discriminant Analysis (DDA), a differentiable, architecture-agnostic loss function that embeds classical discriminant principles for network training. DDA explicitly maximizes between-class variance while minimizing within-class one, promoting compact and separable feature distributions without increasing inference cost. Evaluations on the DIS5K benchmark demonstrate that DDA consistently improves segmentation accuracy, boundary sharpness, and model confidence across various architectures. Our results show that integrating discriminant analysis offers a simple, effective path for building more robust segmentation models.

[CV-92] ViMU: Benchmarking Video Metaphorical Understanding

链接: https://arxiv.org/abs/2605.14607
作者: Qi Li,Xinchao Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Any new medium, once it emerges, is used for more than the transmission of overt content alone. The information it carries typically operates on two levels: one is the content directly presented, while the other is the subtext beneath it-the implicit ideas and intentions the creator seeks to convey through the medium. Likewise, since video technologies became widely adopted, video has served not only as a powerful tool for recording and communicating visual information, but also as a vehicle for emotions, attitudes, and social meanings that are often difficult to articulate explicitly. Thus, the true meaning of many videos does not reside solely in what is shown on screen; it is often embedded in context, style of expression, and the viewer’s social experience. Some forms of such video subtext are humorous, while others carry irony, mockery, or criticism. These implicit meanings can also be interpreted very differently across cultural backgrounds and social groups. However, most existing video understanding models still focus primarily on literal visual comprehension, such as recognizing objects, actions, or temporal relations, and lack a systematic ability to understand the metaphorical, ironic, and social meanings embedded in videos. To bridge this gap, we introduce ViMU, the first benchmark designed to systematically evaluate the subtext understanding capabilities of frontier models in videos. ViMU assesses whether video understanding models can go beyond literal perception to infer implicit meaning while grounding their interpretations in multimodal evidence and answering both open-ended and multiple-choice questions. Importantly, all questions are designed to be hint-free, ensuring that no key evidence is disclosed to models before answering.

[CV-93] MambaRain: Multi-Scale Mamba-Attention Framework for 0-3 Hour Precipitation Nowcasting

链接: https://arxiv.org/abs/2605.14606
作者: Chunlei Shi,Cui Wu,Xiang Xu,Hao Li,Ni Fan,Xue Han,Yongchao Feng,Yufeng Zhu,Boyu Liu,Zengliang Zang,Hongbin Wang,Yanlan Yang,Dan Niu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages,7 figures

点击查看摘要

Abstract:Accurate precipitation nowcasting over extended horizons (0-3 hours) is essential for disaster mitigation and operational decision-making, yet remains a critical challenge in the field. Existing deterministic approaches are predominantly constrained to shorter prediction windows (0-2 hours), exhibiting severe performance degradation beyond 90 minutes owing to their inherent difficulty in capturing long-range spatiotemporal dependencies from radar-derived observations. To address these fundamental limitations, we propose MambaRain, a novel multi-scale encoder-decoder architecture that synergistically integrates Mamba’s linear-complexity long-range temporal modeling with self-attention mechanisms for explicit spatial correlation capture. The core innovation lies in a hybrid design paradigm wherein Mamba blocks leverage selective state space mechanisms to model global temporal dynamics across extended sequences with computational efficiency, while self-attention modules explicitly characterize spatial correlations within precipitation fields - a capability inherently absent in Mamba’s sequential processing paradigm. This complementary synergy enables comprehensive spatiotemporal representation learning, effectively extending the viable forecasting horizon to 2-3 hours with substantial accuracy improvements. Furthermore, we introduce a spectral loss formulation to mitigate blurring artifacts characteristic of chaotic precipitation systems, thereby preserving fine-scale motion details critical for nowcasting accuracy. Experimental validation demonstrates that MambaRain substantially outperforms existing deterministic methodologies in 0-3 hour nowcasting tasks, with particularly pronounced performance gains in the challenging 2-3 hour prediction range.

[CV-94] owards Accurate Single Panoramic 3D Detection: A Semantic Gaussian Centric Approach ICME2026

链接: https://arxiv.org/abs/2605.14601
作者: Kanglin Ning,Yiran Zhao,Wenrui Li,Shaoru Sun,Xingtao Wang,Xiaopeng Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Current has been accepted by ICME 2026

点击查看摘要

Abstract:Three-dimensional object detection in panoramic imagery is crucial for comprehensive scene understanding, yet accurately mapping 2D features to 3D remains a significant challenge. Prevailing methods often project 2D features onto discrete 3D grids, which break geometric continuity and limit representation efficiency. To overcome this limitation, this paper proposes PanoGSDet, a monocular panoramic 3D detection framework built upon continuous semantic 3D Gaussian representations. The proposed framework comprises a panoramic depth estimation component and a semantic Gaussian component. The panoramic depth estimation component extracts the equirectangular semantic and depth features from the monocular panorama input. The semantic Gaussian component includes a semantic Gaussian lifting module that projects spherical features into 3D semantic Gaussians, a semantic Gaussian optimization module that refines these semantic Gaussians, and a Gaussian guided prediction head that generates 3D bounding boxes from optimized Gaussian representations. Extensive experiments on the Structured3D dataset demonstrate that our method significantly outperforms existing methods.

[CV-95] VMU-Diff: A Coarse-to-fine Multi-source Data Fusion Framework for Precipitation Nowcasting

链接: https://arxiv.org/abs/2605.14597
作者: Chunlei Shi,Hao Li,Yufeng Zhu,Boyu Liu,Yongchao Feng,Zengliang Zang,Hongbin Wang,Yanlan Yang,Dan Niu
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Engineering, Finance, and Science (cs.CE); Multimedia (cs.MM)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Precipitation nowcasting is a vital spatio-temporal prediction task for meteorological applications but faces challenges due to the chaotic property of precipitation systems. Existing methods predominantly rely on single-source radar data to build either deterministic or probabilistic models for extrapolation. However, the single deterministic model suffers from blurring due to MSE convergence. The single probabilistic model, typically represented by diffusion models, can generate fine details but suffers from spurious artifacts that compromise accuracy and computational inefficiency. To address these challenges, this paper proposes a novel coarse-to-fine Vision Mamba Unet and residual Diffusion (VMU-Diff) based precipitation nowcasting framework. It realizes precipitation nowcasting through a two-stage process, i.e., a deterministic model-based coarse stage to predict global motion trends and a probabilistic model-based fine stage to generate fine prediction details. In the coarse prediction stage, rather than single-source radar data, both radar and multi-band satellite data are taken as input. A spatial-temporal attention block and several Vision mamba state-space blocks realize multi-source data fusion, and predict the future echo global dynamics. The fine-grained stage is realized by a spatio-temporal refine generator based on residual conditional diffusion models. It first obtains spatio-temporal residual features based on coarse prediction and ground truth, and further reconstructs the residual via conditional Mamba state-space module. Experiments on Jiangsu SWAN datasets demonstrate the improvements of our method over state-of-the-art methods, particularly in short-term forecasts.

[CV-96] OPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

链接: https://arxiv.org/abs/2605.14594
作者: Bojun Xiong,Zoubin Bi,Xinghui Peng,Yunmu Wang,Junchen Deng,Jun Liang,Jing Li,Bowen Cai,Huan Fu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Technical Report

点击查看摘要

Abstract:High-fidelity 3D head generation plays a crucial role in the film, animation and video game industries. In industrial pipelines, studios typically enforce a fixed reference topology across all head assets, as such a clean and uniform topology is a prerequisite for production-level rigging, skinning and animation. In this paper, we present TOPOS, a framework tailored for single image conditioned 3D head generation that jointly recovers geometry and appearance under such an industry-standard topology. In contrast to general 3D generative models which produce triangle meshes with inconsistent topology and numerous vertices, hindering semantic correspondence and asset-level reuse, TOPOS generates head meshes with a fixed, studio-style topology, enabling consistent vertex-level correspondence across all generated heads. To model heads under this unified topology, we proposed a novel variational autoencoder structure, termed TOPOS-VAE. Inspired by multi-model large language models (MLLMs), our TOPOS-VAE leverages the Perceiver Resampler to convert input pointclouds sampled from head meshes of diverse topologies into the target reference topology. Building upon TOPOS-VAE’s structured latent space, we train a rectified flow transformer, TOPOS-DiT, to efficiently generate high-fidelity head meshes from a single image. We further present TOPOS-Texture, an end-to-end module that produces relightable UV texture maps from the same portrait image via fine-tuning a multimodal image generative model. The generated textures are spatially aligned with the underlying mesh geometry and faithfully preserve high-frequency appearance details. Extensive experiments demonstrate that TOPOS achieves state-of-the-art performance on 3D head generation, surpassing both classical face reconstruction methods and general 3D object generative models, highlighting its effectiveness for digital human creation.

[CV-97] FedStain: Modeling Higher-Order Stain Statistics for Federated Domain Generalization in Computational Pathology

链接: https://arxiv.org/abs/2605.14590
作者: Fengyi Zhang,Junya Zhang,Wenzhuo Sun
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust whole-slide image (WSI) analysis under strict data-governance remains challenging due to substantial cross-institutional stain heterogeneity. Domain generalization (DG) mitigates these shifts but typically requires centralized data, conflicting with privacy regulations. Federated learning (FedL) provides a decentralized alternative; however, existing FedL and federated DG (FedDG) approaches rely almost exclusively on low-order statistics, assuming Gaussian-like stain distributions. In contrast, real-world staining processes often produce asymmetric, heavy-tailed color distributions due to biochemical diffusion and scanner nonlinearity. Consequently, current methods fail to model the higher-order, non-Gaussian characteristics dominating real-world stain variability. To address this, we propose FedStain, a stain-aware FedDG framework explicitly incorporating higher-order stain moments–skewness and kurtosis–as compact statistical descriptors exchanged during federated optimization. These descriptors require no pixel-level data transmission, preserving strict privacy and communication efficiency, while enabling the global model to capture stain variability missed by low-order statistics. FedStain also employs a contrastive, cross-site parameter aggregation strategy to promote stain-invariant representations without relaxing data constraints. Extensive experiments on Camelyon17 and our new MvMidog-Fed benchmark show FedStain yields consistent improvements, outperforming state-of-the-art FedL, DG, and FedDG baselines by up to +3.9% absolute accuracy. To our knowledge, FedStain is the first FedDG approach to explicitly model higher-order stain statistics, enabling robust cross-institutional deployment in computational pathology.

[CV-98] Med-DisSeg: Dispersion-Driven Representation Learning for Fine-Grained Medical Image Segmentation

链接: https://arxiv.org/abs/2605.14579
作者: Zhiquan Chen,Haitao Wang,Guowei Zou,Hejun Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate medical image segmentation is fundamental to precision medicine, yet robust delineation remains challenging under heterogeneous appearances, ambiguous boundaries, and large anatomical variability. Similar intensity and texture patterns between targets and surrounding tissues often lead to blurred activations and unreliable separation. We attribute these failures to representation collapse during encoding and insufficient fine grained multi scale decoding. To address these issues, we propose Med DisSeg, a dispersion driven medical image segmentation framework that jointly improves representation learning and anatomical delineation. Med DisSeg combines a lightweight Dispersive Loss with adaptive attention for fine grained structure segmentation. The Dispersive Loss enlarges inter sample margins by treating in batch hidden representations as negative pairs, producing well dispersed and boundary aware embeddings with negligible overhead. Based on these enhanced representations, the encoder strengthens structure sensitive responses, while the decoder performs adaptive multi scale calibration to preserve complementary local texture and global shape information. Extensive experiments on five datasets spanning three imaging modalities demonstrate consistent state of the art performance. Moreover, Med DisSeg achieves competitive results on multi organ CT segmentation, supporting its robustness and cross task applicability.

[CV-99] Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction CVPR2026

链接: https://arxiv.org/abs/2605.14569
作者: Yujie Wei,Chenglong Ma,Jianxiong Gao,Chenhui Wang,Shiwei Zhang,Biao Gong,Shuai Tan,Hangjie Yuan,Hongming Shan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Reconstructing dynamic visual experiences as videos from functional magnetic resonance imaging (fMRI) is pivotal for advancing the understanding of neural processes. However, current fMRI-to-video reconstruction methods are hindered by a semantic gap between noisy fMRI signals and the rich content of videos, stemming from a reliance on incomplete semantic embeddings that neither capture video-specific cues (e.g., actions) nor integrate prior knowledge. To this end, we draw inspiration from the dual-pathway processing mechanism in human brain and introduce CineNeuron, a novel hierarchical framework for semantically enhanced video reconstruction from fMRI signals with two synergistic stages. First, a bottom-up semantic enrichment stage maps fMRI signals to a rich embedding space that comprehensively captures textual semantics, image contents, action concepts, and object categories. Second, a top-down memory integration stage utilizes the proposed Mixture-of-Memories method to dynamically select relevant “memories” from previously seen data and fuse them with the fMRI embedding to refine the video reconstruction. Extensive experimental results on two fMRI-to-video benchmarks demonstrate that CineNeuron surpasses state-of-the-art methods across various metrics.

[CV-100] SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

链接: https://arxiv.org/abs/2605.14566
作者: Zhiquan Chen,Haitao Wang,Guowei Zou,Hejun Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

[CV-101] LiWi: Layering in the Wild

链接: https://arxiv.org/abs/2605.14552
作者: Yu He,Fang Li,Haoyang Tong,Lichen Ma,Xinyuan Shan,Jingling Fu,Dong Chen,Luohang Liu,Junshi Huang,Yan Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

[CV-102] Local Spatiotemporal Convolutional Network for Robust Gait Recognition

链接: https://arxiv.org/abs/2605.14548
作者: Xiaoyun Wang,Cunrong Li,Wu Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Gait recognition, as a promising biometric technology, identifies individuals through their unique walking patterns and offers distinctive advantages including non-invasiveness, long-range applicability, and resistance to deliberate disguise. Despite these merits, capturing the intrinsic motion patterns concealed within consecutive video frames remains challenging due to the complexity of video data and the interference of external covariates such as viewpoint changes, clothing variations, and carrying conditions. Existing approaches predominantly rely on either static appearance features extracted from individual silhouette frames or employ complex sequential models (\eg, LSTM, 3D convolutions) that demand substantial computational resources and sophisticated training strategies. To address these limitations, we propose a Local Spatiotemporal Convolutional Network (LSTCN), a structurally simple yet highly effective dual-branch architecture that endows standard two-dimensional convolutional networks with the capacity to extract temporal information. Specifically, we introduce a Global Bidirectional Spatial Pooling (GBSP) mechanism that reduces the dimensionality of gait tensors by decomposing spatial features into horizontal and vertical strip-based local representations, enabling the temporal dimension to participate in standard 2D convolution operations. Building upon this, we design a Local Spatiotemporal Convolutional (LSTC) layer that jointly processes temporal and spatial dimensions, allowing the network to adaptively learn strip-based gait motion patterns. We further extend this formulation with asymmetric convolution kernels that independently attend to the temporal, spatial, and joint spatiotemporal domains, thereby enriching the extracted feature representations.

[CV-103] PROVE: A Perceptual RemOVal cohErence Benchmark for Visual Media

链接: https://arxiv.org/abs/2605.14534
作者: Fuhao Li,Shaofeng You,Jiagao Hu,Yu Liu,Yuxuan Chen,Zepeng Wang,Fei Wang,Daiguo Zhou,Jian Luan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project Page: this https URL

点击查看摘要

Abstract:Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: this https URL.

[CV-104] Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

链接: https://arxiv.org/abs/2605.14530
作者: Sujung Hong,Chanyong Yoon,Seongjae Hwang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.

[CV-105] From Sparse to Dense: Spatio-Temporal Fusion for Multi-View 3D Human Pose Estimation with DenseWarper

链接: https://arxiv.org/abs/2605.14525
作者: Ling Li,Changjie Chen,Yuyan Wang,Jiaqing Lyu,Kenglun Chang,Yiyun Chen,Zhidong Deng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In multi-view 3D human pose estimation, models typically rely on images captured simultaneously from different camera views to predict a pose at a specific moment. While providing accurate spatial information, this traditional approach often overlooks the rich temporal dependencies between adjacent frames. We propose a novel 3D human pose estimation input method: the sparse interleaved input to address this. This method leverages images captured from different camera views at various time points (e.g., View 1 at time t and View 2 at time t+\delta ), allowing our model to capture rich spatio-temporal information and effectively boost performance. More importantly, this approach offers two key advantages: First, it can theoretically increase the output pose frame rate by N times with N cameras, thereby breaking through single-view frame rate limitations and enhancing the temporal resolution of the production. Second, using a sparse subset of available frames, our method can reduce data redundancy and simultaneously achieve better performance. We introduce the DenseWarper model, which leverages epipolar geometry for efficient spatio-temporal heatmap exchange. We conducted extensive experiments on the Human3.6M and MPI-INF-3DHP datasets. Results demonstrate that our method, utilizing only sparse interleaved images as input, outperforms traditional dense multi-view input approaches and achieves state-of-the-art performance. The source code for this work is available at: this https URL

[CV-106] ArcGate: Adaptive Arctangent Gated Activation

链接: https://arxiv.org/abs/2605.14518
作者: Avik Bhattacharya,Siddhant Dnyanesh Gole,Subhasis Chaudhuri,Alejandro C. Frery,Biplab Banerjee
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Activation functions are central to deep networks, influencing non-linearity, feature learning, convergence, and robustness. This paper proposes the Adaptive Arctangent Gated Activation (ArcGate) function, a flexible formulation that generates a broad spectrum of activation shapes via a three-stage non-linear transformation. Unlike conventional fixed-shape activations such as ReLU, GELU, or SiLU, ArcGate uses seven learnable parameters per layer, allowing the neural network to autonomously optimize its non-linearity to the specific requirements of the feature hierarchy and data distribution. We evaluate ArcGate using ResNet-50 and Vision Transformer (ViT-B/16) architectures on three widely used remote sensing benchmarks: PatternNet, UC Merced Land Use, and the 13-band EuroSAT MSI multispectral dataset. Experimental results show that ArcGate consistently outperforms standard baselines, achieving a peak overall accuracy of 99.67% on PatternNet. Most notably, ArcGate exhibits superior structural resilience in noisy environments, maintaining a 26.65% performance lead over ReLU under moderate Gaussian noise (standard deviation 0.1). Analysis of the learned parameters reveals a depth-dependent functional evolution, where the model increases gating strength in deeper layers to enhance signal propagation. These findings suggest that ArcGate is a robust and adaptive general node activation function for high-resolution earth observation tasks.

[CV-107] HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

链接: https://arxiv.org/abs/2605.14513
作者: Xuzhe Zheng,Yuexiao Ma,Jing Xu,Xiawu Zheng,Rongrong Ji,Fei Chao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top- p sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top- p thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

[CV-108] Head Forcing: Long Autoregressive Video Generation via Head Heterogeneity

链接: https://arxiv.org/abs/2605.14487
作者: Jiahao Tian,Yiwei Wang,Gang Yu,Chi Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines. Project Page: this https URL.

[CV-109] Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection

链接: https://arxiv.org/abs/2605.14486
作者: Yiheng Li,Yang Yang,Zichang Tan,Gao Li,Zhen Lei,Wenhao Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: preprint

点击查看摘要

Abstract:As the misuse of AI-generated images grows, generalizable image detection techniques are urgently needed. Recent state-of-the-art (SOTA) methods adopt aligned training datasets to reduce content, size, and format biases, empowering models to capture robust forgery cues. A common strategy is to employ reconstruction techniques, e.g., VAE and DDIM, which show remarkable results in diffusion-based methods. However, such reconstruction-based approaches typically introduce limited and homogeneous artifacts, which cannot fully capture diverse generative patterns, such as GAN-based methods. To complement reconstruction-based fake images with aligned yet diverse artifact patterns, we propose a GAN-based upsampling approach that mimics GAN-generated fake patterns while preserving content, size, and format alignment. This naturally results in two aligned but distinct types of fake images. However, due to the domain shift between reconstruction-based and upsampling-based fake images, direct mixed training causes suboptimal results, where one domain disrupts feature learning of the other. Accordingly, we propose a Separate Expert Fusion (SEF) framework to extract complementary artifact information and reduce inter-domain interference. We first train domain-specific experts via LoRA adaptation on a frozen foundational model, then conduct decoupled fusion with a gating network to adaptively combine expert features while retaining their specialized knowledge. Rather than merely benefiting GAN-generated image detection, this design introduces diverse and complementary artifact patterns that enable SEF to learn a more robust decision boundary and improve generalization across broader generative methods. Extensive experiments demonstrate that our method yields strong results across 13 diverse benchmarks. Codes are released at: this https URL.

[CV-110] GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding

链接: https://arxiv.org/abs/2605.14475
作者: Jiashun Zhu,Ronghao Fu,Jiasen Hu,Nachuan Xing,Xu Na,Xiao Yang,Zhiwen Lin,Weipeng Zhang,Lang Sun,Zhiheng Xue,Haoran Liu,Weijie Zhang,Bo Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Interpreting ultra-high-resolution (UHR) remote sensing images requires models to search for sparse and tiny visual evidence across large-scale scenes. Existing remote sensing vision-language models can inspect local regions with zooming and cropping tools, but most exploration strategies follow either a one-shot focus or a single sequential trajectory. Such single-path exploration can lose global context, leave scattered regions unvisited, and revisit or count the same evidence multiple times. To this end, we propose GeoVista, a planning-driven active perception framework for UHR remote sensing interpretation. Instead of committing to one zooming path, GeoVista first builds a global exploration plan, then verifies multiple candidate regions through branch-wise local inspection, while maintaining an explicit evidence state for cross-region aggregation and de-duplication. To enable this behavior, we introduce APEX-GRO, a cold-start supervised trajectory corpus that reformulates diverse UHR tasks as Global-Region-Object interactive reasoning processes with a unified, scale-invariant spatial representation. We further design an Observe-Plan-Track mechanism for global observation, adaptive region inspection, and evidence tracking, and align the model with a GRPO-based strategy using step-wise rewards for planning, localization, and final answer correctness. Experiments on RSHR-Bench, XLRS-Bench, and LRS-VQA show that GeoVista achieves state-of-the-art performance. Code and dataset are available at this https URL

[CV-111] Real2Sim in HOI: Toward Physically Plausible HOI Reconstruction from Monocular Videos

链接: https://arxiv.org/abs/2605.14462
作者: Yubo Zhao,Yujin Chai,Yunao Dong,Chengfeng Zhao,Zijiao Zeng,Yuan Liu,Chi-Keung Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering 4D human-object interaction (HOI) from monocular video is a key step toward scalable 3D content creation, embodied AI, and simulation-based learning. Recent methods can reconstruct temporally coherent human and object trajectories, but these trajectories often remain visual artifacts while failing to preserve stable contact, functional manipulation, or physical plausibility when used as reference motions for humanoid-object simulation. This reveals a fundamental interaction gap: HOI reconstruction should not stop at tracking a human and an object, but should recover the relation that makes their motion a coherent interaction. We introduce \textbfHA-HOI , a framework for reconstructing physically plausible 4D HOI animation from in-the-wild monocular videos. Instead of treating the human and object as independent entities in an ambiguous monocular 3D space, we propose a \textithuman-first, object-follow formulation. The human motion is recovered as the interaction anchor, and the object is reconstructed, aligned, and refined relative to the human action. The resulting kinematic trajectory is then projected into a physics-based humanoid-object simulation, where it acts as a teacher trajectory for stable physical rollout. Across benchmark and in-the-wild videos, \textbfHA-HOI improves human-object alignment, contact consistency, temporal stability, and simulation readiness over prior monocular HOI reconstruction methods. By moving beyond visually plausible trajectory recovery toward physically grounded interaction animation, our work takes a step toward turning general monocular HOI videos into scalable demonstrations for humanoid-object behavior. Project page: this https URL

[CV-112] ClickRemoval: An Interactive Open-Source Tool for Object Removal in Diffusion Models

链接: https://arxiv.org/abs/2605.14461
作者: Ledun Zhang,Yatu Ji,Xufei Zhuang,Xinying Yao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures. Open-source software paper

点击查看摘要

Abstract:Existing object removal tools often rely on manual masks or text prompts, making precise removal difficult for non-expert users in complex scenes and often leading to incomplete removal or unnatural background completion. To address this issue, we present ClickRemoval, an open-source interactive object removal tool built on pretrained Stable Diffusion models and driven solely by user clicks. Without additional training, hand-drawn masks, or text descriptions, ClickRemoval localizes target objects and restores the background through self-attention modulation during denoising. Experiments show that ClickRemoval achieves competitive results across quantitative metrics and user studies. We release a complete software package at this https URL under the Apache-2.0 license.

[CV-113] Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

链接: https://arxiv.org/abs/2605.14417
作者: Haozhe Jia,Honglei Jin,Yuan Zhang,Youcheng Fan,Shaofeng Liang,Lei Wang,Shuxu Jin,Kuimou Yu,Zinuo Zhang,Jianfei Song,Wenshuo Chen,Yutao Yue
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbfDAJI (\emphDynamics-Aligned Joint Intent), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

[CV-114] GeoViSTA: Geospatial Vision-Tabular Transformer for Multimodal Environment Representation

链接: https://arxiv.org/abs/2605.14406
作者: Yuhao Liu,Sadeer Al-Kindi,Ashok Veeraraghavan,Guha Balakrishnan
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large-scale pretraining on Earth observation imagery has yielded powerful representations of the natural and built environment. However, most existing geospatial foundation models do not directly model the structured socioeconomic covariates typically stored in tabular form. This modality gap limits their ability to capture the complete total environment, which is critical for reasoning about complex environmental, social, and health-related outcomes. In this work, we propose GeoViSTA (Geospatial Vision-Tabular Transformer), a vision-tabular architecture that learns unified geospatial embeddings from co-registered gridded imagery and tabular data. GeoViSTA utilizes bilateral cross-attention to exchange spatial and semantic information across modalities, guided by a geography-aware attention mechanism that aligns continuous image patches with irregular census-tract tokens. We train GeoViSTA with a self-supervised joint masked-autoencoding objective, forcing it to recover missing image patches and tabular rows using local spatial context and cross-modal cues. Empirically, GeoViSTA’s unified embeddings improve linear probing performance on high-impact downstream tasks, outperforming baselines in predicting disease-specific mortality and fire hazard frequency across held-out regions. These results demonstrate that jointly modeling the physical environment alongside structured socioeconomic context yields highly transferable representations for holistic geospatial inference.

[CV-115] DermAgent : A Self-Reflective Agent ic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making MICCAI2026

链接: https://arxiv.org/abs/2605.14403
作者: Yize Liu,Siyuan Yan,Ming Hu,Lie Ju,Xieji Li,Feilong Tang,Wei Feng,Zongyuan Ge
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2026 early acceptance

点击查看摘要

Abstract:Dermatological diagnosis requires integrating fine-grained visual perception with expert clinical knowledge. Although Multimodal Large Language Models (MLLMs) facilitate interactive medical image analysis, their application in dermatology is hindered by insufficient domain-specific grounding and hallucinations. To address these issues, we propose DermAgent, a collaborative multi-tool agent that orchestrates seven specialized vision and language modules within a Plan-Execute-Reflect framework. DermAgent delivers stepwise, traceable diagnostic reasoning through three core components. First, it employs complementary visual perception tools for comprehensive morphological description, dermoscopic concept annotation, and disease diagnosis. Second, to overcome the lack of domain prior, a dual-modality retrieval module anchors every prediction in external evidence by cross-referencing 413,210 diagnosed image cases and 3,199 clinical guideline chunks. To further mitigate hallucinations, a deterministic critic module conducts strict post-hoc auditing via confidence, coverage, and conflict gates, automatically detecting inter-source disagreements to trigger targeted self-correction. Extensive experiments on five dermatology benchmarks demonstrate that DermAgent consistently outperforms state-of-the-art MLLMs and medical agent baselines across zero-shot fine-grained disease diagnosis, concept annotation, and clinical captioning tasks, exceeding GPT-4o by 17.6% in skin disease diagnostic accuracy and 3.15% in captioning ROUGE-L. Our code is available at this https URL.

[CV-116] SceneForge: Structured World Supervision from 3D Interventions

链接: https://arxiv.org/abs/2605.14399
作者: Jizhizi Li,Jiayang Ao,Danny Wicks,Petru-Daniel Tudosiu
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.

[CV-117] Systematic Discovery of Semantic Attacks in Online Map Construction through Conditional Diffusion

链接: https://arxiv.org/abs/2605.14396
作者: Chenyi Wang,Ruoyu Song,Raymond Muller,Jean-Philippe Monteuuis,Jonathan Petit,Z. Berkay Celik,Ryan Gerdes,Ming F. Li
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous vehicles depend on online HD map construction to perceive lane boundaries, dividers, and pedestrian crossings – safety-critical road elements that directly govern motion planning. While existing pixel perturbation attacks can disrupt the mapping, they can be neutralized by standard adversarial defenses. We present MIRAGE, a framework for systematic discovery of semantic attacks that bypass adversarial defenses and degrade mapping predictions by finding plausible environmental variation (e.g. shadows, wet roads). MIRAGE exploits the latent manifold of real-world data learned by diffusion models, and searches for semantically mutated scenes neighboring the ground truth with the same road topology yet mislead the mapping predictions. We evaluate MIRAGE on nuScenes and demonstrate two attacks: (1) boundary removal, suppressing 57.7% of detections and corrupting 96% of planned trajectories; and (2) boundary injection, the only method that successfully injects fictitious boundaries, while pixel PGD and AdvPatch fail entirely. Both attacks remain potent under various adversarial defenses. We use two independent VLM judges to quantify realism, where MIRAGE passes as realistic 80–84% of the time (vs. 97–99% for clean nuScenes), while AdvPatch only 0–9%. Our findings expose a categorical gap in current adversarial defenses: semantic-level perturbations that manifest as legitimate environmental variation are substantially harder to mitigate than pixel-level perturbations.

[CV-118] Analogical Trajectory Transfer

链接: https://arxiv.org/abs/2605.14393
作者: Junho Kim,Eun Sun Lee,Gwangtak Bae,Seunggu Kang,Young Min Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We study analogical trajectory transfer, where the goal is to translate motion trajectories in one 3D environment to a semantically analogous location in another. Such a capacity would enable machines to perform analogical spatial reasoning, with applications in AR/VR co-presence, content creation, and robotics. However, even semantically similar scenes can still differ substantially in object placement, scale, and layout, so naively matching semantics leads to collisions or geometric distortions. Furthermore, finding where each trajectory point should transfer to has a large search space, as the mapping must preserve semantics and functionality without tearing the trajectory apart or causing collisions. Our key insight is to decompose the problem into spatially segregated subproblems and merge their solutions to produce semantically consistent and spatially coherent transfers. Specifically, we partition scenes into object-centric clusters and estimate cross-scene mappings via hierarchical smooth map prediction, using 3D foundation model features that encode contextual information from object and open-space arrangements. We then combinatorially assemble the per-cluster maps into an initial transfer and refine the result to remove collisions and distortions, yielding a spatially coherent trajectory. Our method does not require training, attains a fast runtime around 0.6 seconds, and outperforms baselines based on LLMs, VLMs, and scene graph matching. We further showcase applications in virtual co-presence, multi-trajectory transfer, camera transfer, and human-to-robot motion transfer, which indicates the broad applicability of our work to AR/VR and robotics.

[CV-119] Dual-Latent Collaborative Decoding for Fidelity-Perception Balanced Image Compression

链接: https://arxiv.org/abs/2605.14391
作者: Qi Mao,Zijian Wang,Zhengxue Cheng,Lingyu Zhu,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Learned image compression (LIC) increasingly requires reconstructions that balance distortion fidelity and perceptual realism across a wide range of bitrates. However, most existing methods still rely on a single compressed latent representation to simultaneously carry structural details, semantic cues, and perceptual priors, requiring the same latent representation to serve multiple, potentially conflicting roles. This tension becomes evident across different latent paradigms: scalar-quantized (SQ) continuous latents provide rate-scalable fidelity but tend to lose perceptual details at low rates, while vector-quantized (VQ) discrete tokens preserve compact semantic cues but suffer from limited structural fidelity and bitrate scalability. To address this issue, we propose Mixture of Decoder Experts (MoDE), a dual-latent collaborative decoding framework that decomposes reconstruction responsibilities across complementary latent paradigms. Specifically, MoDE treats the SQ branch as a fidelity-oriented expert and the VQ branch as a perception-oriented expert, and coordinates them through two decoder-side modules: Expert-Specific Enhancement (ESE), which preserves branch-specific expert references, and Cross-Expert Modulation (CEM), which enables selective complementary transfer during reconstruction. The resulting framework supports selective cross-latent collaboration under a shared dual-stream bitstream and enables both fidelity-anchored and perception-anchored decoding. Extensive experiments demonstrate that MoDE achieves a more favorable fidelity-perception balance than representative distortion-oriented, perception-oriented, generative, and dual-latent baselines across a wide bitrate range, highlighting decoder-side expert collaboration as an effective design for wide-range fidelity-perception balanced LIC.

[CV-120] Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

链接: https://arxiv.org/abs/2605.14382
作者: Yuheng Wu,Xiangbo Gao,Tianhao Chen,Xinghao Chen,Qing Yin,Zhengzhong Tu,Dongman Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

[CV-121] Learning with Semantic Priors: Stabilizing Point-Supervised Infrared Small Target Detection via Hierarchical Knowledge Distillation

链接: https://arxiv.org/abs/2605.14346
作者: Yuanhang Yao,Ping Qian,Zhu Liu,Long Ma,Weimin Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Single-frame Infrared Small Target Detection (ISTD) aims to localize weak targets under heavy background clutter, yet dense pixel-wise annotations are expensive. Point supervision with online label evolution reduces annotation cost; however, lightweight CNN detectors often lack sufficient semantics, leading to noisy pseudo-masks and unstable optimization. To address this, we propose a hierarchical VFM-driven knowledge distillation framework that uses a frozen Vision Foundation Model (VFM) during training. We formulate point-supervised learning as a bilevel optimization process: the inner loop adapts a VFM-embedded teacher on reweighted training samples, while the outer loop transfers validation-guided knowledge to a lightweight student to mitigate pseudo-label noise and training-set bias. We further introduce Semantic-Conditioned Affine Modulation (SCAM) to inject VFM semantics into CNN features at multiple layers. In addition, a dynamic collaborative learning strategy with cluster-level sample reweighting enhances robustness to imperfect pseudo-masks. Experiments on diverse challenging cases across multiple ISTD backbones demonstrate consistent improvements in detection accuracy and training stability. Our code is available at this https URL.

[CV-122] AnyBand-Diff: A Unified Remote Sensing Image Generation and Band Repair Framework with Spectral Priors

链接: https://arxiv.org/abs/2605.14341
作者: Zuopeng Zhao,Ying Liu,Xiaoyu Li,Su Luo,Lu Li,Wenwen Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing diffusion models have made significant progress in generating realistic images. However, their direct adaptation to remote sensing imagery often disregards intrinsic physical laws. This oversight frequently leads to spectral distortion and radiometric inconsistency, severely limiting the scientific utility of generated data. To address this issue, this paper introduces AnyBand-Diff, a novel spectral-prior-guided diffusion framework tailored for robust spectral reconstruction. Specifically, we design a Masked Conditional Diffusion backbone integrated with a dual stochastic masking strategy, empowering the model to recover complete spectral information from arbitrary band subsets. Subsequently, to ensure radiometric fidelity, a Physics-Guided Sampling mechanism is proposed, leveraging gradients from a differentiable physical model to explicitly steer the denoising trajectory toward the manifold of physically plausible solutions. Furthermore, a Multi-Scale Physical Loss is formulated to enforce rigorous constraints across pixel, region, and global levels in a joint manner. Extensive experiments confirm the effectiveness of AnyBand-Diff in generating reliable imagery and achieving accurate spectral reconstruction, contributing to the advancement of physics-aware generative methods for Earth observation.

[CV-123] IG-Diff: Complex Night Scene Restoration with Illumination-Guided Diffusion Model

链接: https://arxiv.org/abs/2605.14337
作者: Yifan Chen,Fei Yin,Chunle Guo,Chongyi Li,Yujiu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CGI-2025

点击查看摘要

Abstract:In nighttime circumstances, it is challenging for individuals and machines to perceive their surroundings. While prevailing image restoration methods adeptly handle singular forms of degradation, they falter when confronted with intricate nocturnal scenes, such as the concurrent presence of weather and low-light conditions. Compounding this challenge, the lack of paired data that encapsulates the coexistence of low-light situations and other forms of degradation hinders the development of a comprehensive end-to-end solution. In this work, we contribute complex nighttime scene datasets that simulate both illumination degradation and other forms of deterioration. To address the complexity of night degradation, we propose an integration of an illumination-guided module embedded in the diffusion model to guide the illumination restoration process. Our model can preserve texture fidelity while contending with the adversities posed by various degradation in low-light scenarios.

[CV-124] InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

链接: https://arxiv.org/abs/2605.14333
作者: Yang Yue,Fangyun Wei,Tianyu He,Jinjing Zhao,Zanlin Ni,Zeyu Liu,Jiayi Guo,Lei Shi,Yue Dong,Li Chen,Ji Li,Gao Huang,Dong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code and checkpoints are available at this https URL

点击查看摘要

Abstract:Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.

[CV-125] D2-CDIG: Controlled Diffusion Remote Sensing Image Generation with Dual Priors of DEM and Cloud-Fog

链接: https://arxiv.org/abs/2605.14326
作者: Zuopeng Zhao,Ying Liu,Kanyaphakphachsorn Pharksuwan,Su Luo,Xiaoyu Li,Maocai Ning
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Remote sensing image generation provides a reliable data foundation for remote sensing large models and downstream tasks. However, existing controllable remote sensing image generation methods typically rely on traditional techniques such as segmentation and edge detection, which do not fully leverage terrain or atmospheric conditions. As a result, the generated images often lack accuracy and naturalness when dealing with complex terrains and atmospheric phenomena. In this paper, we propose a novel remote sensing image generation framework, D2-CDIG, which integrates diffusion models with a dual-prior control mechanism. By incorporating both Digital Elevation Model (DEM) and cloud-fog information as dual prior knowledge, D2-CDIG precisely controls ground features and atmospheric phenomena within the generated images. Specifically, D2-CDIG decouples the terrain and atmospheric generation processes through independent control of ground and atmospheric branches. Additionally, a refined cloud-fog slider is introduced to flexibly adjust cloud thickness and distribution. During training, ground and atmospheric control signals are injected in layers to ensure a seamless transition within the images. Compared to traditional methods based on segmentation or edge detection, D2-CDIG shows significant improvements in image quality, detail richness, and realism. D2-CDIG offers a flexible and precise solution for remote sensing image generation, providing high-quality data for training large remote sensing models and downstream tasks.

[CV-126] urboVGGT: Fast Visual Geometry Reconstruction with Adaptive Alternating Attention

链接: https://arxiv.org/abs/2605.14315
作者: David Huang,Guile Wu,Chengjie Huang,Bingbing Liu,Dongfeng Bai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recent feed-forward 3D reconstruction methods, such as visual geometry transformers, have substantially advanced the traditional per-scene optimization paradigm by enabling effective multi-view reconstruction in a single forward pass. However, most existing methods struggle to achieve a balance between reconstruction quality and computational efficiency, which limits their scalability and efficiency. Although some efficient visual geometry transformers have recently emerged, they typically use the same sparsity ratio across layers and frames and lack mechanisms to adaptively learn representative tokens to capture global relationships, leading to suboptimal performance. In this work, we propose TurboVGGT, a novel approach that employs an efficient visual geometry transformer with adaptive alternating attention for fast multi-view 3D reconstruction. Specifically, TurboVGGT employs an end-to-end trainable framework with adaptive sparse global attention guided by adaptive sparsity selection to capture global relationships across frames and frame attention to aggregate local details within each frame. In the adaptive sparse global attention, TurboVGGT adaptively learns representative tokens with varying sparsity levels for global geometry modeling, considering that token importance varies across frames, attention layers operate tokens at different levels of abstraction, and global dependencies rely on structurally informative regions. Extensive experiments on multiple 3D reconstruction benchmarks demonstrate that TurboVGGT achieves fast multi-view reconstruction while maintaining competitive reconstruction quality compared with state-of-the-art methods. Project page: this https URL.

[CV-127] CoRDS: Coreset-based Representative and Diverse Selection for Streaming Video Understanding

链接: https://arxiv.org/abs/2605.14310
作者: Ailar Mahdizadeh,Puria Azadi,Muchen Li,Xiangteng He,Leonid Sigal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Streaming video understanding with large vision-language models (VLMs) requires a compact memory that can support future reasoning over an ever-growing visual history. A common solution is to compress the key-value (KV) cache, but existing streaming methods typically rely on local token-wise heuristics, such as recency, temporal redundancy, or saliency, which do not explicitly optimize whether the retained cache is representative of the accumulated history. We propose to view KV-cache compression as a coreset selection problem: rather than scoring tokens independently for retention, we select a small subset that covers the geometry of the accumulated visual cache. Our method operates in a joint KV representation and introduces a bicriteria objective that balances coverage in key and value spaces, preserving both retrieval structure and output-relevant information. To encourage a more diverse retained subset, we further introduce an orthogonality-driven diversity criterion that favors candidates contributing new directions beyond the current selection, and connect this criterion to log-determinant subset selection. Across four open-source VLMs and five long-video and streaming-video benchmarks, our method improves over heuristic streaming compression baselines under a fixed cache budget. These results highlight that representative coreset selection offers a more effective principle, than token-wise pruning, for memory-constrained streaming video understanding.

[CV-128] ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition

链接: https://arxiv.org/abs/2605.14309
作者: Shen Lin,Jing Lin,Junhao Dong,Piotr Koniusz,Li Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.

[CV-129] KVPO: ODE-Native GRPO for Autoregressive Video Alignment via KV Semantic Exploration

链接: https://arxiv.org/abs/2605.14278
作者: Ruicheng Zhang,Kaixi Cong,Jun Zhou,Zhizhou Zhong,Zunnan Xu,Shuiyang Mao,Wei Liu,Xiu Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Aligning streaming autoregressive (AR) video generators with human preferences is challenging. Existing reinforcement learning methods predominantly rely on noise-based exploration and SDE-based surrogate policies that are mismatched to the deterministic ODE dynamics of distilled AR models, and tend to perturb low-level appearance rather than the high-level semantic storyline progression critical for long-horizon coherence. To address these limitations, we present KVPO, an ODE-native online Group Relative Policy Optimization (GRPO) framework for aligning streaming video generators. For diversity exploration, KVPO introduces a causal-semantic exploration paradigm that relocates the source of variation from stochastic noise to the historical KV cache. By stochastically routing historical KV entries, it constructs semantically diverse generation branches that remain strictly on the data manifold. For policy modeling, KVPO introduces a velocity-field surrogate policy based on Trajectory Velocity Energy (TVE), which quantifies branch likelihood in flow-matching velocity space and yields a reward-weighted contrastive objective fully consistent with the native ODE formulation. Experiments on multiple distilled AR video generators demonstrate consistent gains in visual quality, motion quality, and text-video alignment across both single-prompt short-video and multi-prompt long-video settings.

[CV-130] CreFlow: Corrective Reflow for Sparse-Reward Embodied Video Diffusion RL

链接: https://arxiv.org/abs/2605.14274
作者: Zhenyang Ni,Yijiang Li,Ruochen Jiao,Simon Sinong Zhan,Sipeng Chen,Zhenfei Yin,Minshuo Chen,Philip Torr,Zhaoran Wang,Qi Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Video generation models trained on heterogeneous data with likelihood-surrogate objectives can produce visually plausible rollouts that violate physical constraints in embodied manipulation. Although reinforcement-learning post-training offers a natural route to adapting VGMs, existing video-RL rewards often reduce each rollout to a low-level visual metric, whereas manipulation video evaluation requires logic-based verification of whether the rollout satisfies a compositional task specification. To fill this gap, we introduce a compositional constraint-based reward model for post-training embodied video generation models, which automatically formulates task requirements as a composition of Linear Temporal Logic constraints, providing faithful rewards and localized error information in generated videos. To achieve effective improvement in high-dimensional video generation using these reward signals, we further propose CreFlow, a novel online RL framework with two key designs: i) a credit-aware NFT loss that confines the RL update to reward-relevant regions, preventing perturbations to unrelated regions during post-training; and ii) a corrective reflow loss that leverages within-group positive samples as an explicit estimate of the correction direction, stabilizing and accelerating training. Experiments show that CreFlow yields reward judgments better aligned with human and simulator success labels than existing methods and improves downstream execution success by 23.8 percentage points across eight bimanual manipulation tasks.

[CV-131] Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers ICML2026

链接: https://arxiv.org/abs/2605.14270
作者: Kanghyun Baek,Jaihyun Lew,Chaehun Shin,Jungbeom Lee,Sungroh Yoon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026

点击查看摘要

Abstract:Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-to-image generation, yet they frequently suffer from concept omission, where specified objects or attributes fail to emerge in the generated image. By performing linear probing on text tokens, we demonstrate that text embeddings can distinguish a characteristic `omission signal’ representing the absence of target concepts. Leveraging this insight, we propose Omission Signal Intervention (OSI), which amplifies the omission signal to actively catalyze the generation of missing concepts. Comprehensive experiments on FLUX.1-Dev and SD3.5-Medium demonstrate that OSI significantly alleviates concept omission even in extreme scenarios.

[CV-132] PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

链接: https://arxiv.org/abs/2605.14269
作者: Yidong Huang,Zun Wang,Han Lin,Dong-Ki Kim,Shayegan Omidshafiei,Jaehong Yoon,Jaemin Cho,Yue Zhang,Mohit Bansal
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First two authors contributed equally, website: this https URL

点击查看摘要

Abstract:Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

[CV-133] Image Restoration via Diffusion Models with Dynamic Resolution ICML2026

链接: https://arxiv.org/abs/2605.14267
作者: Yang Zheng,Wen Li,Zhaoqiang Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026

点击查看摘要

Abstract:Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at this https URL.

[CV-134] Architecture-Aware Explanation Auditing for Industrial Visual Inspection

链接: https://arxiv.org/abs/2605.14255
作者: Sibo Jia,Zihang Zhao,Kunrong Li
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Industrial visual inspection systems increasingly rely on deep classifiers whose heatmap explanations may appear visually plausible while failing to identify the image regions that actually drive model decisions. This paper operationalizes an architecture-aware explanation audit protocol grounded in the native-readout hypothesis: the perturbation-based faithfulness of an explanation method is bounded by its structural distance from the model’s native decision mechanism. On WM-811K wafer maps (9 classes, 172k images) under a three-seed zero-fill perturbation protocol, ViT-Tiny + Attention Rollout attains Deletion AUC 0.211 against 0.432-0.525 for Swin-Tiny / ResNet18+CBAM / DenseNet121 + Grad-CAM (abs(Cohen’s d) 1.1), despite lower classification accuracy. Swin-Tiny disentangles architecture family from readout structure: despite being a Transformer, its spatial feature-map hierarchy makes it Grad-CAM compatible, showing that the operative factor is readout structure rather than architecture family. A model-agnostic control (RISE) compresses all families to Deletion AUC about 0.1, indicating the gap arises from the explainer pathway; notably, RISE outperforms all native methods, so native readout is a compatibility principle rather than an optimality guarantee. A blur-fill sensitivity analysis shows that the family ordering reverses under a different perturbation baseline, reinforcing that faithfulness rankings are joint properties of (model, explainer, perturbation operator) triples. An exploratory boundary-condition study on MVTec AD (pretrained models) indicates that audit results are dataset/task dependent and identifies conditions requiring qualification. The protocol yields actionable guidance: explanation pathways should be co-designed with model architectures based on readout structure, and deployed heatmaps should be accompanied by quantitative faithfulness metrics.

[CV-135] owards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

链接: https://arxiv.org/abs/2605.14253
作者: Harry Robertshaw,Yanghe Hao,Weiyuan Deng,Benjamin Jackson,S.M.Hadi Sadati,Nikola Fischer,Tom Vercauteren,Alejandro Granados,Thomas C. Booth
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery

点击查看摘要

Abstract:Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

[CV-136] Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

链接: https://arxiv.org/abs/2605.14251
作者: Aarushi Kulkarni,Alarice Lowe,Pratik Shah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (HE) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women’s Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for HE-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. HE restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational HE staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

[CV-137] Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

链接: https://arxiv.org/abs/2605.14239
作者: Zekun Long,Judy X. Yang,Jing Wang,Ali Zia,Guanyiman Fu,Jun Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figure, conference

点击查看摘要

Abstract:Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen’s Kappa, while maintaining an efficient architecture. Comments: 6 pages, 1 figure, conference Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2605.14239 [cs.CV] (or arXiv:2605.14239v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2605.14239 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[CV-138] Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

链接: https://arxiv.org/abs/2605.14221
作者: Ahmed Rekik,R. Jarrett Rushmore,Sylvain Bouix,Linda Marrakchi-Kacem
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

点击查看摘要

Abstract:Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard–Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

[CV-139] MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving NEURIPS2026

链接: https://arxiv.org/abs/2605.14201
作者: Rajeev Yasarla,Deepti Hegde,Hsin-Pai Cheng,Shizhong Han,Yunxiao Shi,Meysam Sadeghigooghari,Hanno Ackermann,Litian Liu,Pranav Desai,Fatih Porikli,Mohammad Ghavamzadeh,Hong Cai
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 9 figures, NeurIPS 2026 submission

点击查看摘要

Abstract:Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

[CV-140] CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers CVPR

链接: https://arxiv.org/abs/2605.14191
作者: Zhuojin Li,Hsin-Pai Cheng,Hong Cai,Shizhong Han,Fatih Porikli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 8 figures, CVPR workshop

点击查看摘要

Abstract:Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-\alpha and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.

[CV-141] You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

链接: https://arxiv.org/abs/2605.14166
作者: Riccardo Carraro,Anna Briotto,Endi Hysa,Marco Fiorucci,Lamberto Ballan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs 128 \times 128 facial images from severely degraded 16 \times 16 inputs, achieving an 8 \times magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

[CV-142] Rethinking the Good Enough Embedding for Easy Few-Shot Learning

链接: https://arxiv.org/abs/2605.14145
作者: Michael Karnes,Alper Yilmaz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The field of deep visual recognition is undergoing a paradigm shift toward universal representations. The Platonic Representation Hypothesis suggests that diverse architectures trained on massive datasets are converging toward a shared, “ideal” latent space. This again raises a critical question: is a “Good Embedding All You Need?” In this paper, we leverage this convergence to demonstrate that off-the-shelf embeddings are inherently “good enough” for complex tasks, rendering intensive task-specific fine-tuning unnecessary. We explore this hypothesis within the few-shot learning framework, proposing a straightforward, non-parametric pipeline that entirely bypasses backpropagation. By utilizing a k-Nearest Neighbor classifier on frozen DINOv2-L features, we conduct a layer-wise characterization to identify an optimal feature extraction. We further demonstrate that manifold refinement via PCA and ICA provides a beneficial regularizing effect. Our results across four major benchmarks demonstrate that our approach consistently surpasses sophisticated meta-learning algorithms, achieving state-of-the-art performance.

[CV-143] DiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion CVPR’26

链接: https://arxiv.org/abs/2605.14136
作者: Nurislam Tursynbek,Zhiqiang Lao,Heather Yu,Gedas Bertasius,Marc Niethammer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR’26 Workshop on Agentic AI for Visual Media

点击查看摘要

Abstract:Recent text-to-video diffusion transformers generate visually compelling frames, yet still struggle with temporal coherence, often producing flickering, drifting, or unstable motion. We show that these failures leave a clear imprint inside the model: incoherent videos consistently exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, whereas stable motion corresponds to smooth, band-diagonal patterns. Building on this observation, we introduce TeDiO, a training-free, inference-time method that reinforces temporal consistency by regularizing these internal attention patterns. TeDiO estimates diagonal smoothness, identifies unstable regions, and performs lightweight latent updates that promote coherent frame-to-frame dynamics, without modifying model weights or using external motion supervision. Across multiple video diffusion models (e.g., Wan2.1, CogVideoX), TeDiO delivers markedly smoother motion while preserving per-frame visual quality, offering an efficient plug-and-play approach to improving dynamic realism in modern video generation systems.

[CV-144] PanoPlane: Plane-Aware Panoramic Completion for Sparse-View Indoor 3D Gaussian Splatting

链接: https://arxiv.org/abs/2605.14135
作者: Adil Qureshi,Dongki Jung,Jaehoon Choi,Dinesh Manocha
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present PanoPlane, an approach for high-fidelity sparse-view indoor novel view synthesis that reconstructs closed room geometry via panoramic scene completion. Unlike perspective-based methods that generate training views from limited fields of view, PanoPlane leverages 360^\circ panoramic completion to condition the generative process on the full spatial layout. We propose Layout Anchored Attention Steering, a training-free mechanism that steers attention within the diffusion model’s internal representation toward scene’s detected planar surfaces at inference time. By directing each unobserved region’s attention toward geometrically consistent observed content, our method replaces unconstrained hallucination with grounded surface extrapolation. The resulting panoramic completions provide supervision for 3D Gaussian Splatting, enabling accurate novel-view synthesis across unobserved regions from as few as three input views. Experiments on Replica, ScanNet++, and Matterport3D demonstrate state-of-the-art novel view synthesis quality across 3, 6, and 9 input views, achieving up to +17.8% improvement in PSNR over the current state-of-the-art baseline without any training or fine-tuning of the diffusion model.

[CV-145] SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection CVPR2026

链接: https://arxiv.org/abs/2605.14110
作者: Sandro Papais,Lezhou Feng,Charles Cossette,Lingting Ge
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to CVPR 2026

点击查看摘要

Abstract:Vision Transformers (ViTs) enable strong multi-view 3D detection but are limited by high inference latency from dense token and query processing across multiple views and large 3D regions. Existing sparsity methods, designed mainly for 2D vision, prune or merge image tokens but do not extend to full-model sparsity or address 3D object queries. We introduce SToRe3D, a relevance-aligned sparsity framework that jointly selects 2D image tokens and 3D object queries while storing filtered features for reactivation. Mutual 2D-3D relevance heads allocate compute to driving-critical content and preserve other embeddings. Evaluated on nuScenes and our new nuScenes-Relevance benchmark, SToRe3D achieves up to 3x faster inference with marginal accuracy loss, establishing real-time large-scale ViT-based 3D detection while maintaining accuracy on planning-critical agents.

[CV-146] Bridging the Rural Healthcare Gap: A Cascaded Edge-Cloud Architecture for Automated Retinal Screening

链接: https://arxiv.org/abs/2605.14108
作者: Nishi Doshi,Shrey Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is one of the leading causes of preventable blindness, yet rural regions often lack the specialists and infrastructure needed for early detection. Although cloud-based deep learning systems offer high accuracy, they face significant challenges in these settings due to high latency, limited bandwidth, and high data transmission costs. To address these challenges, we propose a two-tier edge-cloud cascade on the public APTOS 2019 Blindness Detection dataset. Tier 1 runs a lightweight MobileNetV3-small model on a local clinic device to perform a binary triage between Referable DR (Classes 2-4) and Non-referable DR (Classes 0-1). Tier 2 runs a RETFoundDINOv2 model in the cloud for ordinal severity grading, but only on the subset of images flagged as referable by Tier 1. On a stratified APTOS test split of 733 images, Tier 1 reaches 98.99% sensitivity and 84.37% specificity at a validation-tuned high-sensitivity threshold. The default cascade forwards 49.52% of test images to Tier 2, reducing cloud calls by 50.48% relative to using a cloud-based model for all images. In the deployed 4-class output space (Class 0-1 / Class 2 / Class 3 / Class 4), the cascade obtains 80.49% accuracy and 0.8167 quadratic weighted kappa; the cloud-only baseline obtains 80.76% accuracy and 0.8184 quadratic weighted kappa. On APTOS, the cascade cuts cloud use by about half with a modest drop in grading performance. Index Terms: Diabetic Retinopathy, Edge-Cloud Cascade, MobileNetV3-small, RETFound-DINOv2, Retinal Screening, tele-ophthalmology

[CV-147] DUET: Dual-Paradigm Adaptive Expert Triage with Single-cell Inductive Prior for Spatial Transcriptomics Prediction

链接: https://arxiv.org/abs/2605.14104
作者: Junchao Zhu,Ruining Deng,Junlin Guo,Tianyuan Yao,Chongyu Qu,Juming Xiong,Zhengyi Lu,Yanfan Zhu,Marilyn Lionts,Yuechen Yang,Yu Wang,Shilin Zhao,Haichun Yang,Yuankai Huo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Inferring spatially resolved gene expression from histology images offers a cost-effective complement to spatial transcriptomics (ST). However, existing methods reduce this task to a simple morphology-to-expression mapping, where visual similarity does not guarantee molecular consistency. Meanwhile, single-cell data has amassed rich resources far surpassing the scale of ST data, yet it remains underexplored in vision-omics modeling. Furthermore, current approaches commit to a monolithic paradigm with bottlenecks, unable to balance expressive flexibility with biological fidelity. To bridge these gaps, we propose DUET, a novel dual-paradigm framework that synergizes parametric prediction and memory-based retrieval under cellular inductive priors. DUET implements a parallel regression-retrieval paradigm, adaptively reconciling the outputs of its complementary pathways. To mitigate aleatoric vision ambiguity, we incorporate large-scale single-cell references to impose molecular states as biological constraints for faithful learning. Building upon structural refinement, we further design a lightweight adapter to dynamically assign branch preference across spatial contexts to achieve optimal performance. Extensive experiments on three public datasets across varied gene scales demonstrate that DUET achieves SOTA performance, with consistent gains contributed by each proposed component. Code is available at this https URL

[CV-148] Venus-DeFakerOne: Unified Fake Image Detection Localization

链接: https://arxiv.org/abs/2605.14091
作者: GuangJian Team
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.

[CV-149] CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

链接: https://arxiv.org/abs/2605.14068
作者: Amirreza Mohseni,Mona Mohammadi,Morteza Saghafian,Naser Talebizadeh Saradari
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce CurveBench, a benchmark for hierarchical topological reasoning from visual input. CurveBench consists of \textbf756 images of pairwise non-intersecting Jordan curves across easy, polygonal, topographic-inspired, maze-like, and dense counting configurations. Each image is annotated with a rooted tree encoding the containment relations between planar regions. We formulate the task as structured prediction: given an image, a model must recover the full rooted containment tree induced by the curves. Despite the visual simplicity of the task, the strongest evaluated model, Gemini 3.1 Pro, achieves only \textbf71.1% tree-generation accuracy on CurveBench-Easy and \textbf19.1% on CurveBench-Hard. We further demonstrate benchmark utility through RLVR-style fine-tuning of open-weight vision-language models. Our trained Qwen3-VL-8B model improves over \textttQwen-3-VL-8B-Thinking from \textbf2.8% to \textbf33.3% tree-generation accuracy on CurveBench-Easy, exceeding GPT-5.4 and Claude Opus 4.5 under our evaluation protocol. The remaining gap, especially on CurveBench-Hard, shows that exact topology-aware visual reasoning remains far from solved.

[CV-150] Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning ICML2026

链接: https://arxiv.org/abs/2605.14054
作者: Haozhe Wang,Qixin Xu,Changpeng Wang,Taofeng Xue,Chong Peng,Wenhu Chen,Fangzhen Lin
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026 as Spotlight

点击查看摘要

Abstract:Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a “seesaw effect” on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception (“bad seeing”) or flawed logic (“bad thinking”)? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a “blindfolded reasoning” proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error – either bad seeing or bad thinking – enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

[CV-151] Evolving Layer-Specific Scalar Functions for Hardware-Aware Transformer Adaptation

链接: https://arxiv.org/abs/2605.14047
作者: Kieran Carrigg,Sigur de Vries,Amirhossein Sadough,Marcel van Gerven
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve state-of-the-art performance on challenging vision tasks, but their deployment on edge devices is severely hindered by the computational complexity and global reduction bottleneck imposed by layer normalization. Recent methods attempt to bypass this by replacing normalization layers with hardware-friendly scalar approximations. However, these homogeneous replacements do not optimally fit to all layers’ behaviour and rely on expensive model retraining. In this work, we propose a highly efficient, hardware-aware framework that utilizes genetic programming (GP) to evolve heterogeneous, layer-specific scalar functions directly from pre-trained weights. Coupled with a novel post-training re-alignment strategy, our approach eliminates the need to retrain models from scratch entirely. Our evolved expressions accurately approximate the target normalization behaviours, capturing 91.6% of the variance ( R^2 ) compared to only 70.2% for homogeneous baselines, allowing our modified architecture to recover 84.25% Top-1 ImageNet-1K accuracy in only 20 epochs. By preserving this performance while eliminating the global reduction bottleneck, our approach establishes a highly favourable trade-off between arithmetic complexity and off-chip memory traffic, removing a primary barrier to the efficient deployment of ViTs on edge accelerators.

[CV-152] PVRF: All-in-one Adverse Weather Removal via Prior-modulated and Velocity-constrained Rectified Flow

链接: https://arxiv.org/abs/2605.14045
作者: Wei Dong,Han Zhou,Terry Ji,Guanhua Zhao,Shahab Asoodeh,Yulun Zhang,Guangtao Zhai,Jun Chen,Xiaohong Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures, and 4 tables

点击查看摘要

Abstract:Adverse weather removal (AWR) in real-world images remains challenging due to heterogeneous and unseen degradations, while distortion-driven training often yields overly smooth results. We propose PVRF, a unified framework that integrates zero-shot soft weather perceptions with velocity-constrained rectified-flow refinement. PVRF introduces an AWR-specific question answering module (AWR-QA) that uses frozen vision–language models (VLMs) to estimate soft probabilities of weather types and low-level attribute scores. These perceptions condition restoration networks via attribute-modulated normalization (AMN) and weather-weighted adapters (WWA), producing an anchor estimate for refinement. We then learn a terminal-consistent residual rectified flow with perception-adaptive source perturbation and a terminal-consistent velocity parameterization to stabilize learning near the terminal regime. Extensive experiments show that PVRF improves both fidelity and perceptual quality over state-of-the-art baselines, with strong cross-dataset generalization on single and combined degradations. Code will be released at this https URL.

[CV-153] Masked Autoencoders with Limited Data: Does It Work? A Fine-Grained Bioacoustics Case Study CVPR2026

链接: https://arxiv.org/abs/2605.14031
作者: Wuao Liu,Mustafa Chasmai,Subhransu Maji,Grant Van Horn
类目: ound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Workshop on Fine-Grained Visual Categorization (FGVC) at CVPR 2026. 8 pages, 6 figures

点击查看摘要

Abstract:Bioacoustic recognition requires fine-grained acoustic understanding to distinguish similar-sounding species. However, many large-scale data repositories such as iNaturalist are weakly annotated, often with only a single positive species label per recording, making supervised learning particularly challenging. Inspired by advances in computer vision, recent approaches have shifted toward self-supervised learning to capture the underlying structure of audio without relying on exhaustive annotations. In particular, masked autoencoders (MAE) have shown strong transferability on massive audio corpora, yet their effectiveness in more modest bioacoustic settings remains underexplored. In this work, we conduct a systematic study of MAE pretraining for species classification on iNatSounds, analyzing the impacts of pretraining data scale, domain specificity, data curation, and transfer strategies. Consistent with prior work, we find that models pretrained on diverse general audio data achieve the best transfer performance on iNatSounds. Contrary to observations from large-scale audio benchmarks, we find that (1) additional masked reconstruction pretraining on domain-specific data provides limited benefits and may even degrade performance relative to off-the-shelf models, and (2) selective data filtering offers a negligible advantage when the overall data scale is limited. Our results indicate that, in moderate-sized fine-grained bioacoustic settings, pretraining scale dominates objective design. These findings further clarify when MAE-based pretraining is effective and provide practical guidance for model selection under limited supervision.

[CV-154] Unified Pix Token And Word Token Generative Language Model

链接: https://arxiv.org/abs/2605.14028
作者: Haun Leung,ZiNan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 6 figures

点击查看摘要

Abstract:Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

[CV-155] CineMesh4D: Personalized 4D Whole Heart Reconstruction from Sparse Cine MRI

链接: https://arxiv.org/abs/2605.13994
作者: Xiaoyue Liu,Xiaohan Yuan,Mark Y Chan,Ching-Hui Sia,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate 3D+t whole-heart mesh reconstruction from cine MRI is a clinically crucial yet technically challenging task. The difficulty of this task arises from two coupled factors: inherently sparse sampling of 3D cardiac anatomy by 2D image slices and the tight coupling between cardiac shape and motion. Current cardiac image-to-mesh approaches typically reconstruct only a subset of cardiac chambers or a single phase of the cardiac cycle. In this work, we propose CineMesh4D, a novel end-to-end 4D (3D+t) pipeline that directly reconstructs patient-specific whole-heart mesh from multi-view 2D cine MRI via cross-domain mapping. Specifically, we introduce a differentiable rendering loss that enables supervision of 3D+t whole-heart mesh from multi-view sparse contours of cine MRI. Furthermore, we develop a dual-context temporal block that fuses global and local cardiac temporal information to capture high-dimensional sequential patterns. In quantitative and qualitative evaluations, CineMesh4D outperforms existing approaches in terms of reconstruction quality and motion consistency, providing a practical pathway for personalized real-time cardiac assessment. The code will be publicly released once the manuscript is accepted.

[CV-156] Few Channels Draw The Whole Picture: Revealing Massive Activations in Diffusion Transformers

链接: https://arxiv.org/abs/2605.13974
作者: Evelyn Turri,Davide Bucciarelli,Sara Sarto,Lorenzo Baraldi,Marcella Cornia
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Project page: this https URL

点击查看摘要

Abstract:Diffusion Transformers (DiTs) and related flow-based architectures are now among the strongest text-to-image generators, yet the internal mechanisms through which prompts shape image semantics remain poorly understood. In this work, we study massive activations: a small subset of hidden-state channels whose responses are consistently much larger than the rest. We show that, despite their sparsity, these few channels effectively draw the whole picture, in three complementary senses. First, they are functionally critical: a controlled disruption probe that zeroes the massive channels causes a sharp collapse in generation quality, while disrupting an equally-sized set of low-statistic channels has marginal effect. Second, they are spatially organized: restricting image-stream tokens to massive channels and clustering them yields coherent partitions that closely align with the main subject and salient regions, exposing a structured spatial code hidden inside an apparently outlier-like subspace. Third, they are transferable: transporting massive activations from one prompt-conditioned trajectory into another, shifts the final image toward the source prompt while preserving substantial content from the target, producing localized semantic interpolation rather than unstructured pixel blending. We exploit this property in two use cases: text-conditioned and image-conditioned semantic transport, where massive activations transport enables prompt interpolation and subject-driven generation without any additional training. Together, these results recast massive activations not as activation anomalies, but as a sparse prompt-conditioned carrier subspace that organizes and controls semantic information in modern DiT models.

[CV-157] Vision-Based Runtime Monitoring under Varying Specifications using Semantic Latent Representations

链接: https://arxiv.org/abs/2605.13923
作者: Bardh Hoxha,Oliver Schön,Hideki Okamoto,Lars Lindemann,Georgios Fainekos
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We study certified runtime monitoring of past-time signal temporal logic (ptSTL) from visual observations under partial observability. The monitor must infer safety-relevant quantities from images and provide finite-sample guarantees, while being \emphreusable: once trained and calibrated, it should certify any formula in a target fragment without per-formula retraining. For fragments induced by a finite dictionary of temporal atoms, we prove that the \emphsemantic basis, the vector of atom robustness scores, is the minimum prediction target within the class of monotone, 1-Lipschitz reusable interfaces: any formula is evaluated by a deterministic decoder derived from the parse tree, and a single conformal calibration pass certifies the entire fragment with no union bound. We also introduce a \emphrolling prediction monitor that predicts only current predicate values and reconstructs temporal history online; this is easier to learn but grows conservative at long horizons. On a pedestrian-crossroad benchmark, rolling achieves tighter certified bounds at short horizons while the semantic-basis monitor is up to 4-times tighter at long horizons. We validate the presented monitors on real-world Waymo driving data, where both monitors satisfy the conformal coverage guarantee empirically.

[CV-158] Elastic Spiking Transformers for Efficient Gesture Understanding

链接: https://arxiv.org/abs/2605.13869
作者: Alberto Ancilotto,Gianluca Amprimo,Stefano Di Carlo,Elisabetta Farella
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs), particularly Spiking Transformers, offer energy-efficient processing of event-based sensor data for healthcare applications. Yet current architectures are rigid: they are trained and deployed as static networks with fixed parameter counts and computational graphs. This limits deployment on neuromorphic hardware such as Loihi and SpiNNaker, where on-chip constraints often require smaller models that trade accuracy for feasibility. We introduce the Elastic Spiking Transformer, a runtime-adaptive architecture that brings elasticity into the spiking paradigm. Inspired by Matryoshka-style representation learning, it embeds nested elasticity in the Feature Extractor, Spiking Self-Attention, and Feed-Forward blocks. Through granularity-aware weight sharing, a single universal model can dynamically slice network width and attention heads at inference time without retraining. This design provides two key advantages for SNNs. First, it allows the model to adjust its parameter footprint to different hardware memory budgets. Second, reducing active neurons also lowers spike firing rates, yielding proportional reductions in synaptic operations, an energy benefit not directly available in standard artificial neural networks. We evaluate the approach on CIFAR10/100, CIFAR10-DVS, and the EHWGesture clinical gesture understanding dataset. Results show that one Elastic Spiking Transformer spans a broad range of complexity-accuracy trade-offs, matching or surpassing independently trained baselines while supporting adaptive, real-time gesture recognition on resource-constrained edge devices.

[CV-159] Seed3D 2.0: Advancing High-Fidelity Simulation-Ready 3D Content Generation

链接: https://arxiv.org/abs/2605.13862
作者: Diandian Gu,Jing Lin,Gaohong Liu,Jiahang Liu,Su Ma,Guang Shi,Jun Wang,Qinlong Wang,Qianyi Wu,Zhongcong Xu,Xuanyu Yi,Zihao Yu,Jianfeng Zhang,Zhuolin Zheng,Yifan Zhu,Rui Chen,Hengkai Guo,Xiaoyang Guo,Mingcong Han,Xu Han,Xiu Li,Yixun Liang,Weiqiang Lou,Junzhe Lu,Guan Luo,Minghan Qin,Shuguang Wang,Yuang Wang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Seed3D 2.0 Technical Report; Official Page on this https URL

点击查看摘要

Abstract:We present Seed3D 2.0, an advanced 3D content generation system built on Seed3D 1.0, with substantial improvements across generation fidelity, simulation-ready capabilities, and application coverage. For geometry, a coarse-to-fine two-stage pipeline decouples global structure learning from high-frequency detail recovery, while a locality-aware VAE achieves higher spatial compression and more efficient decoding. For texture and material generation, we replace the cascaded pipeline of Seed3D 1.0 with a unified PBR model that directly generates multi-view albedo and metallic-roughness maps, enhanced by Mixture-of-Experts scaling and VLM-based semantic conditioning for improved material precision and visual fidelity. Beyond single-object generation, Seed3D 2.0 introduces a simulation-ready model suite comprising scene layout planning, part-aware decomposition, and training-free articulation generation, enabling coherent scene construction and part-level physical interaction across physics and graphics engines. A large-scale human preference study against five recent commercial models shows that Seed3D 2.0 achieves consistent win rates of 69.0% to 89.9% in textured 3D asset generation. Seed3D 2.0 is available on this https URL

[CV-160] MoZoo:Unleashing Video Diffusion power in animal fur and muscle simulation

链接: https://arxiv.org/abs/2605.13857
作者: Dongxia Liu,Jie Ma,Xiaochen Yang,Jiancheng Zhang,Bin Xia,Zhehan Kan,Nisha Huang,Jun Liang,Wenming Yang,Jin Li
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Github Page: this https URL

点击查看摘要

Abstract:The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.

[CV-161] SparseOIT: Improving Order-Independent Transparency 3DGS via Active Set Method

链接: https://arxiv.org/abs/2605.13855
作者: Wentao Yang,Fanzhen Kong,Zejian Kang,Xiangru Huang
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has received tremendous popularity over the past few years due to its photorealistic visual appearance. However, 3DGS uses volumetric rendering that is not suitable for objects with non-lambertian or transparent materials. To remedy this issue, a family of Order-Independent Transparency (OIT) rendering methods propose to remove or modify the depth sorting step in the 3DGS rendering equation. However, the potential of OIT-based method is still underexplored. In this paper, we observe that the OIT modifications to the rendering equation significantly reduce the inter-independence among individual gaussian splats, resulting in very sparse variable dependencies that can be harnessed by specific optimization techniques such as active set method. To this end, we propose SparseOIT, an OIT-based 3DGS reconstruction algorithm that maintains an active set of gaussian splats and enjoys an acceleration ratio that is proportional to the potential sparsity. SparseOIT is designed by jointly considering the OIT rendering equation, the reconstruction algorithm and the geometric regularization. Through extensive experiments, we demonstrate that SparseOIT outperforms existing methods in the OIT-family by a large margin and also achieves comparable performance to the state-of-the-art 3DGS reconstruction methods based on volumetric rendering. Project page:

[CV-162] Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery ICME2026

链接: https://arxiv.org/abs/2605.13854
作者: Minghao Sun,Chongyang Xu,Yitao Xie,Buzhen Huang,Kun Li
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: ICME 2026

点击查看摘要

Abstract:Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at this https URL.

[CV-163] FaceParts: Segmentation and Editing of Gaussian Splatting

链接: https://arxiv.org/abs/2605.13853
作者: Tymoteusz Zapała,Julia Farganus,Dominik Galus,Mikołaj Czachorowski,Piotr Syga,Przemysław Spurek
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Facial editing is an important task with applications in entertainment, virtual reality, and digital avatars. Most existing approaches rely on generative models in the 2D image domain, while in 3D the task is typically performed through labor-intensive manual editing. We propose FaceParts, a framework for unsupervised segmentation and editing of Gaussian Splatting avatars. Unlike existing 2D or mesh-assisted methods, our approach operates directly in the Gaussian domain, decomposing avatars into semantically coherent facial parts without supervision. The method integrates feature disentanglement, density-based clustering, and FLAME-anchored part transfer, enabling precise editing and cross-avatar part swapping. Experiments on the NeRSemble dataset with 11 subjects demonstrate robust isolation of features such as beards, eyebrows, eyes and mustaches. Quantitative evaluation confirms that transferred segments adapt to pose and expression, while maintaining identity consistency (ID = 0.943), low Average Expression Distance (AED = 0.021) and low Average Pose Distance (APD = 0.004).

[CV-164] Realiz3D: 3D Generation Made Photorealistic via Domain-Aware Learning CVPR2026

链接: https://arxiv.org/abs/2605.13852
作者: Ido Sobol,Kihyuk Sohn,Yoav Blum,Egor Zakharov,Max Bluvstein,Andrea Vedaldi,Or Litany
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026. Project page: this https URL

点击查看摘要

Abstract:We often aim to generate images that are both photorealistic and 3D-consistent, adhering to precise geometry, material, and viewpoint controls. Typically, this is achieved by fine-tuning an image generator, pre-trained on billions of real images, using renders of synthetic 3D assets, where annotations for control signals are available. While this approach can learn the desired controls, it often compromises the realism of the images due to domain gap between photographs and renders. We observe that this issue largely arises from the model learning an unintended association between the presence of control signals and the synthetic appearance of the images. To address this, we introduce Realiz3D, a lightweight framework for training diffusion models, that decouples controls and visual domain. The key idea is to explicitly learn visual domain, real or synthetic, separately from other control signals by introducing a co-variate that, fed into small residual adapters, shifts the domain. Then, the generator can be trained to gain controllability, without fitting to specific visual domain. In this way, the model can be guided to produce realistic images even when controls are applied. We enhance control transferability to the real domain by leveraging insights about roles of different layers and denoising steps in diffusion-based generators, informing new training and inference strategies that further mitigate the gap. We demonstrate the advantages of Realiz3D in tasks as text-to-multiview generation and texturing from 3D inputs, producing outputs that are 3D-consistent and photorealistic.

[CV-165] Efficient Dense Matching for Enhanced Gaussian Splatting Using AV1 Motion Vectors

链接: https://arxiv.org/abs/2605.14629
作者: Julien Zouein,Vibhoothi Vibhoothi,François Pitié,Anil Kokaram
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a prominent framework for real-time, photorealistic scene reconstruction, offering significant speed-ups over Neural Radiance Fields (NeRF). However, the fidelity of 3DGS representations remains heavily dependent on the quality of the initial point cloud. While standard Structure-from-Motion (SfM) pipelines using COLMAP provide adequate initialisation, they often suffer from high computational costs and sparsity in textureless regions, which degrades subsequent reconstruction accuracy and convergence speed. In this work, we introduce an AV1-based feature detection and matching pipeline that significantly reduces SfM processing overhead. By leveraging motion vectors inherent to the AV1 video codec, we bypass computationally expensive exhaustive matching while maintaining geometric robustness. Our pipeline produces substantially denser point clouds, with up to eight times as many points as classical SfM. We demonstrate that this enhanced initialisation directly improves 3DGS performance, yielding an 9-point increase in VMAF and a 63% average reduction in training time required to reach baseline quality. The project page: this https URL

[CV-166] Keyed Nonlinear Transform: Lightweight Privacy-Enhancing Feature Sharing for Medical Image Analysis

链接: https://arxiv.org/abs/2605.14123
作者: Haebom Lee,Gyeongjung Kim
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feature sharing via split inference offers a lightweight alternative to federated learning for resource-constrained hospitals, but transmitted features still leak patient identity information and lack practical mechanisms for controlled feature sharing. We propose Keyed Nonlinear Transform (KNT), a drop-in feature transformation that applies key-conditioned obfuscation to intermediate representations. KNT reduces re-identification AUC from 0.635 to 0.586, corresponding to a 36% reduction in above-chance identity signal, while introducing only 0.15 ms CPU overhead, without backbone retraining, and preserving classification performance within 1.0 pp. Our analysis shows that KNT’s nonlinear transform prevents closed-form inversion and shifts recovery to iterative gradient-based optimization under full key compromise, substantially increasing inversion difficulty. The same transform generalizes to dense prediction tasks, incurring only a 4.4 pp Dice reduction on skin-lesion segmentation without retraining. These results position KNT as a practical and efficient privacy layer for split inference deployments.

[CV-167] Covariance-aware sampling for Diffusion Models

链接: https://arxiv.org/abs/2605.13910
作者: Andrea Schioppa,Tim Salimans
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present a covariance-aware sampler that improves the quality of pixel-space Diffusion Model (DM) sampling in the few-step regime. We hypothesize that in the few-step regime samplers fail because they rely solely on the predicted mean of the reverse distribution, while our solution explicitly models the reverse-process covariance. Our method combines Tweedie’s formula to estimate the covariance with an efficient, structured Fourier-space decomposition of the covariance matrix. Implemented as an extension of DDIM, our method requires only a minimal overhead: one extra Jacobian-Vector Product (JVP) per step. We demonstrate that for pixel-based DMs, our method consistently produces superior samples compared to state-of-the-art second order samplers (Heun, DPM-Solver++) and the recent aDDIM sampler, at an identical number of function evaluations (NFE).

[CV-168] Physics-Grounded Adversarial Stain Augmentation with Calibrated Coverag e Guarantees

链接: https://arxiv.org/abs/2605.13889
作者: Mingi Hong
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Stain variation across hospitals degrades histopathology models at deployment. Existing augmentation methods perturb color spaces with arbitrary hyperparameters, lacking both a principled budget and coverage guarantees for unseen centers. We propose \textbfCalibrated \textbfAdversarial \textbfStain \textbfAugmentation (\textbfCASA), which performs adversarial augmentation in the Macenko stain parameter space with a budget calibrated from multi-center statistics via the DKW inequality. On Camelyon17-WILDS (5 seeds), CASA achieves 93.9% \pm 1.6% slide-level accuracy – outperforming HED-strong ( 88.4% \pm 7.3% ), RandStainNA ( 85.2% \pm 6.7% ), and ERM ( 63.9% \pm 11.3% ) – with the highest worst-group accuracy ( 84.9% \pm 0.9% ) among all 10 compared methods.

人工智能

[AI-0] Eradicating Negative Transfer in Multi-Physics Foundation Models via Sparse Mixture-of-Experts Routing

链接: https://arxiv.org/abs/2605.15179
作者: Ellwil Sharma,Arastu Sharma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注: 5 pages, 4 figures

点击查看摘要

Abstract:Scaling Scientific Machine Learning (SciML) toward universal foundation models is bottlenecked by negative transfer: the simultaneous co-training of disparate partial differential equation (PDE) regimes can induce gradient conflict, unstable optimization, and plasticity loss in dense neural operators. In particular, broadband open-channel fluid dynamics and boundary-dominated porous media flows impose incompatible spectral and geometric demands on a single dense parameter path. We introduce Shodh-MoE, a sparse-activated latent transformer architecture for multi-physics transport. Shodh-MoE operates on compressed 16^3 physical latents produced by a physics-informed autoencoder with an intra-tokenizer Helmholtz-style velocity parameterization, restricting decoded states to divergence-free velocity manifolds. The model guarantees exact mass conservation, achieving a physically verifiable velocity divergence of ~2.8 x 10^-10 (evaluated post-hoc in FP64) on 128^3 grids. A Top-1 soft-semantic router dynamically assigns localized latent patches to expert subnetworks, enabling specialized parameter paths for distinct physical mechanisms while preserving shared experts for universal symmetries. In a 20,000-step distributed pretraining run over mixed three-dimensional physical tensors, routing telemetry shows autonomous domain bifurcation: held-out validation tokens from the open-channel domain route exclusively to Expert 0, while porous-media tokens route exclusively to Expert 1. The model converges simultaneously across both regimes, achieving latent validation MSEs of 2.46 x 10^-5 and 9.76 x 10^-6, and decoded physical MSEs of 2.48 x 10^-6 and 1.76 x 10^-6. These results support sparse expert routing as a practical architectural mechanism for mitigating multi-physics interference in universal neural operators.

[AI-1] OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation

链接: https://arxiv.org/abs/2605.15177
作者: Shang Zhou,Wenhao Chai,Kaiyuan Liu,Huanzhi Mao,Qiuyang Mang,Jingbo Shang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures

点击查看摘要

Abstract:Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro’s effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

[AI-2] Position: Behavioural Assurance Cannot Verify the Safety Claims Governance Now Demands

链接: https://arxiv.org/abs/2605.15164
作者: Pratinav Seth,Vinay Kumar Sankarapu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper argues that behavioural assurance, even when carefully designed, is being asked to carry safety claims it cannot verify. AI governance frameworks enacted between 2019 and early 2026 require reviewable evidence of properties such as the absence of hidden objectives, resistance to loss-of-control precursors, and bounded catastrophic capability; current assurance methodologies (primarily behavioural evaluations and red-teaming) are epistemically limited to observable model outputs and cannot verify the latent representations or long-horizon agentic behaviours these frameworks presume to regulate. We formalize this structural mismatch as the audit gap, the divergence between required and achievable verification access, and introduce the concept of fragile assurance to describe cases where the evidential structure does not support the asserted safety claim. Through an analysis of a 21-instrument inventory, we identify an incentive gradient where geopolitical and industrial pressures systematically reward surface-level behavioral proxies over deep structural verification. Finally, we propose a technical pivot: bounding the weight of behavioral evidence in legal text and extending voluntary pre-deployment access with mechanistic-evidence classes, specifically linear probes, activation patching, and before/after-training comparisons.

[AI-3] Pelican-Unified 1.0: A Unified Embodied Intelligence Model for Understanding Reasoning Imagination and Action

链接: https://arxiv.org/abs/2605.15153
作者: Yi Zhang,Yinda Chen,Che Liu,Zeyuan Ding,Jin Xu,Shilong Zou,Junwei Liao,Jiayu Hu,Xiancong Ren,Xiaopeng Zhang,Yechi Liu,Haoyuan Shi,Zecong Tang,Haosong Sun,Renwen Cui,Kuishu Wu,Wenhai Liu,Yang Xu,Yingji Zhang,Yidong Wang,Senkang Hu,Jinpeng Lu,Nga Teng Chan,Yechen Wu,Yong Dai,Jian Tang,Xiaozhu Ju
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present Pelican-Unified 1.0, the first embodied foundation model trained according to the principle of unification. Pelican-Unified 1.0 uses a single VLM as a unified understanding module, mapping scenes, instructions, visual contexts, and action histories into a shared semantic space. The same VLM also serves as a unified reasoning module, autoregressively producing task-, action-, and future-oriented chains of thought in a single forward pass and projecting the final hidden state into a dense latent variable. A Unified Future Generator (UFG) then conditions on this latent variable and jointly generates future videos and future actions through two modality-specific output heads within the same denoising process. The language, video, and action losses are all backpropagated into the shared representation, enabling the model to jointly optimize understanding, reasoning, imagination, and action during training, rather than training three isolated expert systems. Experiments demonstrate that unification does not imply compromise. With a single checkpoint, Pelican-Unified 1.0 achieves strong performance across all three capabilities: 64.7 on eight VLM benchmarks, the best among comparable-scale models; 66.03 on WorldArena, ranking first; and 93.5 on RoboTwin, the second-best average among compared action methods. These results show that the unified paradigm succeeds in preserving specialist strength while bringing understanding, reasoning, imagination, and action into one model. Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15153 [cs.RO] (or arXiv:2605.15153v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2605.15153 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-4] Widening the Gap: Exploiting LLM Quantization via Outlier Injection

链接: https://arxiv.org/abs/2605.15152
作者: Xiaohua Zhan,Kazuki Egashira,Robin Staab,Mark Vero,Martin Vechev
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM quantization has become essential for memory-efficient deployment. Recent work has shown that quantization schemes can pose critical security risks: an adversary may release a model that appears benign in full precision but exhibits malicious behavior once quantized by users. However, existing quantization-conditioned attacks have been limited to relatively simple quantization methods, where the attacker can estimate weight regions that remain invariant under the target quantization. Notably, prior attacks have consistently failed to compromise more popular and sophisticated schemes, limiting their practical impact. In this work, we introduce the first quantization-conditioned attack that consistently induces malicious behavior that can be triggered by a broad range of advanced quantization techniques, including AWQ, GPTQ, and GGUF I-quants. Our attack exploits a simple property shared by many modern quantization methods: large outliers can cause other weights to be rounded to zero. Consequently, by injecting outliers into specific weight blocks, an adversary can therefore induce a targeted, predictable weight collapse in the model. This effect can be used to craft seemingly benign full-precision models that exhibit a wide range of malicious behaviors after quantization. Through extensive evaluation across three attack scenarios and LLMs, we show that our attack achieves high success rates against a broad range of quantization methods on which prior attacks fail. Our results demonstrate, for the first time, that the security risks of quantization are not restricted to simpler schemes but are broadly relevant across complex, widely-used quantization methods.

[AI-5] Dual-Dimensional Consistency: Balancing Budget and Quality in Adaptive Inference-Time Scaling

链接: https://arxiv.org/abs/2605.15100
作者: Rongman Xu,Yifei Li,Tianzhe Zhao,Yanrui Wu,Bo Li,Hang Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable abilities in reasoning. However, maximizing their potential through inference-time scaling faces challenges in trade-off between sampling budget and reasoning quality. Current strategies remain inefficient as they typically treat sampling width and depth as orthogonal objectives, where width consensus methods risk reinforcing hallucinations, while depth pruning mechanisms prematurely truncate complex yet valid reasoning chains. Therefore, we propose Dual-Dimensional Consistency (DDC), a unified framework that bridges path quality with adaptive termination. By coupling Confidence-Weighted Bayesian protocol with a Trend-Aware Stratified Pruning, our method ensures that computational resources are concentrated on high quality reasoning paths, filtering hallucinations while accelerating consensus. Evaluations across five benchmarks demonstrate that this approach reduces token consumption by over 10 times while maintaining or exceeding the accuracy of strong baselines across various LLMs.

[AI-6] Novel Dynamic Batch-Sensitive Adam Optimiser for Vehicular Accident Injury Severity Prediction

链接: https://arxiv.org/abs/2605.15083
作者: Daniel Asare Kyei,Alimatu Saadia-Yussiff,Maame G. Asante-Mensah,Abdul Lateef-Yussiff,Charles Roland Haruna,Derry Emmanuel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The choice of optimiser is important in deep learning, as it strongly influences model efficiency and speed of convergence. However, many commonly used optimisers encounter difficulties when applied to imbalanced and sequential datasets, limiting their ability to capture patterns of minority classes. In this study, we propose Dynamic Batch-Sensitive Adam (DBS-Adam), an optimiser that dynamically scales the learning rate using a batch difficulty score derived from exponential moving averages of gradient norms and batch loss. DBS-Adam improves training stability and accelerates convergence by increasing updates for difficult batches and reducing them for easier ones. We evaluate DBS-Adam by integrating it with Bi-Directional LSTM networks for accident injury severity prediction, addressing class imbalance through SMOTE-ENN resampling and Focal Loss. Four experimental configurations compare baseline Bi-LSTM models and alternative architectures to assess optimiser impact. Rigorous comparison against state-of-the-art optimisers (AMSGrad, AdamW, AdaBound) across five random seeds demonstrated DBS-Adam’s competitive performance with statistically significant precision improvements (p=0.020). Results indicate that DBS-Adam outperforms standard optimisation approaches, achieving 95.22% test accuracy, 96.11% precision, 95.28% recall, 95.39% F1-score, and a test loss of 0.0086. The proposed framework enables effective real-time accident severity classification for targeted emergency response and road safety interventions, demonstrating the value of DBS-Adam for learning from imbalanced sequential data.

[AI-7] NeuroTrain: Surveying Local Learning Rules for Spiking Neural Networks with an Open Benchmarking Framework

链接: https://arxiv.org/abs/2605.15058
作者: Alessio Caviglia,Filippo Marostica,Roberta Bardini,Alessandro Savino,Stefano Di Carlo
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid expansion of spiking neural networks (SNNs) has led to a proliferation of training algorithms that differ widely in biological inspiration, computational structure, and hardware suitability. Despite this progress, the field lacks a unified, fine-grained taxonomy that systematically organizes these approaches and clarifies their conceptual relationships. This survey provides a comprehensive taxonomy of SNN training algorithms, spanning surrogate-gradient backpropagation, local and three-factor learning rules, biologically inspired plasticity mechanisms, ANN-to-SNN conversion pipelines, and non-standard optimization strategies. We analyze each class in terms of its computational principles, learning signals, and locality properties. To support reproducible research, we release NeuroTrain, an open-source snnTorch-based framework that implements a representative set of these algorithms within a unified, modular, and extendable framework, enabling consistent benchmarking across datasets, architectures, and training regimes. By consolidating fragmented literature and providing a reusable benchmarking framework, this survey identifies common patterns, highlights open challenges, and outlines promising directions for future work on scalable, efficient SNN training.

[AI-8] FGN: Task-Free Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

链接: https://arxiv.org/abs/2605.15053
作者: Anurup Ganguli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 65 pages, 10 figures, 40 tables

点击查看摘要

Abstract:Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and =99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source-target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner. Comments: 65 pages, 10 figures, 40 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.15053 [cs.LG] (or arXiv:2605.15053v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15053 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anurup Ganguli [view email] [v1] Thu, 14 May 2026 16:46:26 UTC (284 KB)

[AI-9] SpeakerLLM : A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

链接: https://arxiv.org/abs/2605.15044
作者: KiHyun Nam,Jungwoo Heo,Siu Bae,Ha-Jin Yu,Joon Son Chung
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

[AI-10] WARD: Adversarially Robust Defense of Web Agents Against Prompt Injections

链接: https://arxiv.org/abs/2605.15030
作者: Tri Cao,Yulin Chen,Hieu Cao,Yibo Li,Khoi Le,Thong Nguyen,Yuexin Li,Yufei He,Yue Liu,Shuicheng Yan,Bryan Hooi
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Code and models: this https URL

点击查看摘要

Abstract:Web agents can autonomously complete online tasks by interacting with websites, but their exposure to open web environments makes them vulnerable to prompt injection attacks embedded in HTML content or visual interfaces. Existing guard models still suffer from limited generalization to unseen domains and attack patterns, high false positive rates on benign content, reduced deployment efficiency due to added latency at each step, and vulnerability to adversarial attacks that evolve over time or directly target the guard itself. To address these limitations, we propose WARD (Web Agent Robust Defense against Prompt Injection), a practical guard model for secure and efficient web agents. WARD is built on WARD-Base, a large-scale dataset with around 177K samples collected from 719 high-traffic URLs and platforms, and WARD-PIG, a dedicated dataset designed for prompt injection attacks targeting the guard model. We further introduce A3T, an adaptive adversarial attack training framework that iteratively strengthens WARD through a memory-based attacker and guard co-evolution process. Extensive experiments show that WARD achieves nearly perfect recall on out-of-distribution benchmarks, maintains low false positive rates to preserve agent utility, remains robust against guard-targeted and adaptive attacks under substantial distribution shifts, and runs efficiently in parallel with the agent without introducing additional latency.

[AI-11] SemaTune: Semantic-Aware Online OS Tuning with Large Language Models

链接: https://arxiv.org/abs/2605.15026
作者: Georgios Liargkovas,Mihir Nitin Joshi,Hubertus Franke,Kostis Kaffes
机构: 未知
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 17 pages, 12 figures

点击查看摘要

Abstract:Online OS tuning can improve long-running services, but existing controllers are poorly matched to live hosts. They treat scheduler, power, memory, and I/O controls as black-box variables and optimize a scalar reward. This view ignores cross-knob policy structure, breaks down when application metrics are unavailable, and can send a running service into degraded regions that persist after the bad setting is removed. We present SemaTune, a host-side framework for steady-state OS tuning with bounded language-model guidance. SemaTune turns knob schemas, telemetry, current configuration, recent action–response history, and retrieved prior runs into a compact decision context. A fast loop proposes low-latency updates, a slower loop periodically revises the search strategy, and every proposed change passes through typed validation before reaching kernel or sysctl interfaces. This lets the controller reason about OS-control meaning and indirect performance signals while keeping model cost, latency, and authority constrained. We evaluate SemaTune on 13 live workloads from five benchmark suites while tuning up to 41 Linux parameters. Across the suite, SemaTune improves stable-phase performance by 72.5% over default settings and by 153.3% relative to the strongest non-LLM baseline. A 30-window session costs about \ 0.20 in model calls. With only host-level metrics, SemaTune still outperforms baselines given direct application objectives by 93.7 percentage points, while avoiding severe degraded regions reached by structure-blind exploration.

[AI-12] Generalized Priority-Aware Shapley Value

链接: https://arxiv.org/abs/2605.15018
作者: Kiljae Lee,Ziqi Liu,Weijing Tang,Yuan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Shapley value and its priority-aware extensions are widely used for valuation in machine learning, but existing methods require pairwise priority to be binary and acyclic, a restriction spectacularly violated in real-data examples such as aggregated human preferences and multi-criterion comparisons. We introduce the generalized priority-aware Shapley value (GPASV), a random order value defined on arbitrary directed weighted priority graphs, in which pairwise edges penalize rather than forbid order violations. GPASV covers a range of classical models as boundary cases. We establish GPASV through an axiomatic characterization, develop the associated computational methods, and introduce a priority sweeping diagnostic extending PASV’s. We apply GPASV to LLM ensemble valuation on the cyclic Chatbot Arena preference graph, illustrating that priority-aware valuation is not a one-button operation: different balances of pairwise graph priority versus individual soft priority produce substantively different valuations of the same data.

[AI-13] Learning Developmental Scaffoldings to Guide Self-Organisation

链接: https://arxiv.org/abs/2605.14998
作者: Milton L. Montero,Elias Najarro,Jakob Schauser,Sebastian Risi
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Quantitative Methods (q-bio.QM)
备注: 10 pages, 5 figures. Under review

点击查看摘要

Abstract:From subcellular structures to entire organisms, many natural systems generate complex organisation through self-organisation: local interactions that collectively give rise to global structure without any blueprint of the outcome. Yet a significant portion of the information driving such processes is not produced by self-organisation itself, instead, it is often offloaded to initial conditions of the system. Biological development is a prime example, where maternal pre-patterns encode positional and symmetry-breaking information that scaffolds the self-organising process. From maternal morphogen gradients in early embryogenesis to tissue-level morphogenetic pre-patterns guiding organ formation, this transfer of information to initial conditions, analogous to a memory-compute trade-off in computational systems, is a fundamental part of developmental processes. In this work, we study this offloading phenomenon by introducing a model that jointly learns both the self-organisation rules and the pre-patterns, allowing their interplay to be varied and measured under controlled conditions: a Neural Cellular Automaton (NCA) paired with a learned coordinate-based pattern generator (SIREN), both trained simultaneously to generate a set of patterns. We provide information-theoretic analyses of how information is distributed between pre-patterns and the self-organising process, and show that jointly learning both components yields improvements in robustness, encoding capacity, and symmetry breaking over purely self-organising alternatives. Our analysis further suggests that effective pre-patterns do not simply approximate their targets; rather, they bias the developmental dynamics in ways that facilitate convergence, pointing to a non-trivial relationship between the structure of initial conditions and the dynamics of self-organisation.

[AI-14] Second-Order Actor-Critic Methods for Discounted MDPs via Policy Hessian Decomposition

链接: https://arxiv.org/abs/2605.14982
作者: Sanjeev Manivannan,Shuban V
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures including Appendix with Detailed proofs

点击查看摘要

Abstract:We address the discounted reward setting in reinforcement learning (RL). To mitigate the value approximation challenges in policy gradient methods, actor-critic approaches have been developed and are known to converge to stationary points under suitable assumptions. However, these methods rely on first-order updates. In contrast, second-order optimization provides principled curvature-aware updates that are proven to accelerate convergence, but its application in RL is limited by the computational complexity of Hessian estimation. In this work, we analyze second-order approximations for the actor update that leverage the full curvature information of the objective as much as possible. A stable approximation requires treating the action-value function as locally constant with respect to policy parameters, which does not generally hold in policy gradient methods. We show that this approximation becomes well-justified under a two-timescale actor-critic framework, where the critic evolves on a faster timescale and can be treated as quasi-stationary during actor updates. Building on this insight, we formulate a second-order actor-critic method for the discounted reward setting that leverages Hessian-vector product (HVP) computations, resulting in a computationally efficient and stable second-order update.

[AI-15] GraphFlow: An Architecture for Formally Verifiable Visual Workflows Enabling Reliable Agent ic AI Automation

链接: https://arxiv.org/abs/2605.14968
作者: Drewry H. Morris V,Luis Valles,Reza Hosseini Ghomi(MedFlow, Inc.)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:GraphFlow is a visual workflow system designed to improve the reliability of agentic AI automation in multi-step, mission-critical processes. In these workflows, small errors compound rapidly: under an idealized model of independent steps, a ten-step process with 90% per-step reliability completes successfully only 35% of the time. Existing workflow platforms provide durable execution and observability but offer few semantic correctness guarantees, while agentic systems plan at inference time, making behavior sensitive to prompt variation and difficult to audit. GraphFlow is designed to address this gap by treating workflow diagrams as the executable specification, a single artifact defining data scope, execution semantics, and monitoring. At compile time, a restricted class of diagrams is specified to produce reusable automations whose contracts (preconditions, postconditions, and composition obligations) are intended to be proof-checked before admission to a shared library. At runtime, a durable engine records outcomes in an append-only event log and can enforce contracts at system boundaries, supporting replay, retries, and audit. Swimlanes make trust boundaries explicit, separating verified logic from external systems, human judgment, and AI decisions. A year-long pilot across three clinical sites executed 8,728 cohort-enrolled workflow runs with a 97.08% completion rate under an early prototype without the verified-core subsystem; observed failures were localized primarily to external integrations. The formal semantics and proof-checked admission model described here are specified and under active development. Evaluation of the verified core is reserved for future work. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14968 [cs.AI] (or arXiv:2605.14968v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14968 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-16] Not All Symbols Are Equal: Importance-Aware Constellation Design for Semantic Communication

链接: https://arxiv.org/abs/2605.14940
作者: Albert Shaju,Christo Kurisummoottil Thomas,Mayukh Roy Chowdhury
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Submitted to IEEE GLOBECOM 2026. 6 pages, 8 figures

点击查看摘要

Abstract:Semantic communication systems for goal-oriented transmission must protect task-relevant information not only through source compression but also via physical layer mapping. Existing approaches decouple constellation design and semantic encoding, exposing critical symbols to channel errors at the same rate as irrelevant ones. Contrary to this, in this paper, a joint semantic-physical layer framework is proposed, which is composed of a vector quantized-variational autoencoder that extracts discrete latent concepts, a semantic criticality indicator (SCI) that scores each concept by task relevance, and a deep reinforcement learning agent that dynamically selects the transmission subset based on instantaneous channel conditions. At the physical layer, a learned semantic-aware M -QAM constellation assigns symbol positions according to joint co-occurrence statistics and SCI scores, departing from the uniform spacing and Gray coding of standard M -QAM which minimizes average BER without regard for semantic content. We introduce a novel semantic symbol vulnerability (SSV) metric and a semantic protection probability (SPP) to quantify the exposure of task-critical symbols to decoding errors, and prove that any Gray-coded constellation is strictly suboptimal in SCI-Weighted SSV whenever the source exhibits non-uniform semantic importance and co-occurrence statistics. Simulation results demonstrate that the proposed constellation achieves near 100% SPP across modulation orders from 4-QAM to 1024-QAM versus 50% for standard constellations at high spectral efficiency, a 21:1 compression ratio with semantic quality above 0.9, generalizing across MNIST, Fashion-MNIST, and FSDD without modification.

[AI-17] Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations

链接: https://arxiv.org/abs/2605.14937
作者: Jonathan Spieler,Angel Villar-Corrales,Sven Behnke
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Predictive world models enable agents to model scene dynamics and reason about the consequences of their actions. Inspired by human perception, object-centric world models capture scene dynamics using object-level representations, which can be used for downstream applications such as action planning. However, most object-centric world models and reinforcement learning (RL) approaches learn reactive policies that are fixed at inference time, limiting generalization to novel situations. We propose Slot-MPC, an object-centric world modeling framework that enables planning through Model Predictive Control (MPC). Slot-MPC leverages vision encoders to learn slot-based representations, which encode individual objects in the scene, and uses these structured representations to learn an action-conditioned object-centric dynamics model. At inference time, the learned dynamics model enables action planning via MPC, allowing agents to adapt to previously unseen situations. Since the learned world model is differentiable, we can use gradient-based MPC to directly optimize actions, which is computationally more efficient than relying on gradient-free, sampling-based MPC methods. Experiments on simulated robotic manipulation tasks show that Slot-MPC improves both task performance and planning efficiency compared to non-object-centric world model baselines. In the considered offline setting with limited state-action coverage, we find that gradient-based MPC performs better than gradient-free, sampling-based MPC. Our results demonstrate that explicitly structured, object-centric representations provide a strong inductive bias for controllable and generalizable decision-making. Code and additional results are available at this https URL.

[AI-18] KGPFN: Unlocking the Potential of Knowledge Graph Foundation Model via In-Context Learning

链接: https://arxiv.org/abs/2605.14907
作者: Yisen Gao,Jiaxin Bai,Haoyu Huang,Zhongwei Xie,Yufei Li,Hong Ting Tsang,Sirui Han,Yangqiu Song
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Knowledge graph (KG) foundation models aim to generalize across graphs with unseen entities and relations by learning transferable relational structure. However, most existing methods primarily emphasize relation-level universality, while in-context learning, the other pillar of foundation models remains under-explored for KG reasoning. In KGs, context is inherently structured and heterogeneous: effective prediction requires conditioning on the local context around the query entities as well as the global context that summarizes how a relation behaves across many instances. We propose KGPFN, a KG foundation model using Prior-data Fitted Network that unifies transferable relational regularities with inference-time in-context learning from structured context. KGPFN first learns relation representations via message passing on relation graphs to capture cross-graph relational invariances. For query-specific reasoning, it encodes local neighborhoods using a multi-layer NBFNet as local context. To enable ICL at global scale, it constructs relation-specific global context by retrieving a large set of instances of the query relation together with their local neighborhoods, and aggregates them within a Prior-Data Fitted Network framework that combines feature-level and sample-level attention. Through multi-graph pretraining on diverse KGs, KGPFN learns when to instantiate reusable patterns and when to override them using contextual evidence. Experiments on 57 KG benchmarks demonstrate that KGPFN achieves strong adaptation to previously unseen graphs through in-context learning alone, consistently outperforming competitive fine-tuned KG foundation models. Our code is available at this https URL.

[AI-19] COREKG: Coreset-Guided Personalized Summarization of Knowledge Graphs IJCAI2026

链接: https://arxiv.org/abs/2605.14900
作者: Sohel Aman Khan,Raghava Mutharaju,Supratim Shit
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at IJCAI 2026

点击查看摘要

Abstract:Knowledge Graphs (KGs) are extensively used across different domains and in several applications. Often, these KGs are very large in size. Such KGs become unwieldy for tasks such as question answering and visualization. Summarization of KGs offers a viable alternative in such cases. Furthermore, personalized KG summarization is crucial in the current data-driven world as it captures the specific requirements of users based on their query patterns. Since it only maintains relevant information, the personalized summaries of KG are small, resulting in significantly smaller storage requirements and query runtime. In this work, we adapt the coreset theory to create personalized KG summaries. For a given dataset and a user-specific query workload, we present an approach that samples a relevant subset of triples using sensitivity-based importance sampling. We ensure that the subset approximates the characteristics of the full dataset with bounded approximation error. We define sensitivity scores that measure the importance of a triple with respect to a user’s query workload, which are then used by our coreset construction algorithm. We explicitly focus on personalized knowledge graph summarization by constructing summaries independently for each user based on their query behaviour. Our evaluation on Freebase, WikiData, and DBpedia shows that COREKG delivers higher query-answering accuracy and structural coverage than the state-of-the-art methods, such as GLIMPSE, PPR, iSummary, PEGASUS and APEX ^2 while requiring only a tiny fraction of the original graph. Comments: Accepted at IJCAI 2026 Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14900 [cs.AI] (or arXiv:2605.14900v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14900 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-20] Critic-Driven Voronoi-Quantization for Distilling Deep RL Policies to Explainable Models AAMAS2026

链接: https://arxiv.org/abs/2605.14897
作者: Senne Deproost,Denis Steckelmacher,Ann Nowé
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted for presentation at EXTRAAMAS 2026

点击查看摘要

Abstract:Despite many successful attempts at explaining Deep Reinforcement Learning policies using distillation, it remains difficult to balance the performance-interpretability trade-off and select a fitting surrogate model. In addition to this, traditional distillation only minimizes the distance between the behavior of the original and the surrogate policy while other RL-specific components such as action value are disregarded. To solve this, we introduce a new model-agnostic method called Critic-Driven Voronoi State Partitioning, which partitions a black box control policy into regions where a simple class of model can be optimized using gradient descent. By exploiting the critic value network of the original policy, we iteratively introduce new subpolicies in regions with insufficient value, standing in for a measure of policy complexity. The partitioning, a Voronoi quantizer, uses nearest neighbor lookups to assign a linear function to each point in the state space resulting in a cell-like diagram. We validate our approach on several well known benchmarks and proof that this distillation approaches the original policy using a reasonable sized set of linear functions.

[AI-21] Beyond Individual Intelligence: Surveying Collaboration Failure Attribution and Self-Evolution in LLM -based Multi-Agent Systems

链接: https://arxiv.org/abs/2605.14892
作者: Shihao Qi,Jie Ma,Rui Xing,Wei Guo,Xiao Huang,Zhitao Gao,Jianhao Deng,Jun Liu,Lingling Zhang,Bifan Wei,Boqian Yang,Pinghui Wang,Jianwen Sun,Jing Tao,Yaqiang Wu,Hui Liu,Yu Yao,Tongliang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.

[AI-22] BiFedKD: Bidirectional Federated Knowledge Distillation Framework for Non-IID and Long-Tailed ECG Monitoring

链接: https://arxiv.org/abs/2605.14886
作者: Zixuan Shu,Tiancheng Cao,Hen-Wei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Electrocardiogram (ECG) monitoring in Internet of Medical Things (IoMT) networks is constrained by strict data-sharing regulations and privacy concerns. Federated learning (FL) enables collaborative learning by keeping raw ECG data on devices, but frequent transmissions of high-dimensional model updates incur heavy per-round traffic over bandwidth-limited links. To alleviate this bottleneck, federated distillation (FD) replaces parameter exchange with logit-based knowledge transfer. However, the performance of FD often degrades under the non-independent and identically distributed (non-IID) and long-tailed label distributions in ECG deployments. To address these challenges, we propose a bidirectional federated knowledge distillation (BiFedKD) framework that employs an aggregation-by-distillation pipeline with temperature scaling to produce a stable global distillation signal for cross-client alignment. Experiments on the MIT-BIH Arrhythmia dataset show that BiFedKD improves accuracy and Macro-F1 over the baseline by 3.52% and 9.93% , respectively. Moreover, to reach the same Macro-F1, BiFedKD reduces communication overhead by 40% and computation cost by 71.7% compared with the baseline.

[AI-23] REALM: Retrospective Encoder Alignment for LFP Modeling

链接: https://arxiv.org/abs/2605.14867
作者: Peicheng Wu,Zhenyu Bu,Runze Ma,Lin Du
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Spike activity has been the dominant neural signal for behavior decoding due to its high spatial and temporal resolution. However, as brain-computer interfaces (BCIs) move toward high channel counts and wireless operation, the high sampling frequency of spike signals becomes a bottleneck due to high power and bandwidth requirements. Local field potentials (LFPs) represent a different spatial-temporal scale of brain activity compared to spikes, offering key advantages including improved long-term stability, reduced energy consumption, and lower bandwidth requirement. Despite these benefits, LFP-based decoding models typically show reduced accuracy and often rely on non-causal architectures that are unsuitable for real-time deployment. To address these challenges, we propose REALM: a retrospective distillation framework that enables causal LFP decoding. Inspired by offline-to-online distillation strategies in speech recognition, REALM transfers representational knowledge from a pretrained multi-session bidirectional LFP model to a causal version for real-time deployment. We first pretrain a bidirectional Mamba-2 teacher model using a masked autoencoding objective. We then distill this teacher model into a compact student model via a combined objective of representation alignment and task supervision. REALM consistently outperforms both causal and non-causal LFP-based SOTA methods for behavior decoding. Notably, our REALM improves decoding performance while achieving a 2\times reduction in parameter count and a 10\times reduction in training time. These results demonstrate that retrospective distillation effectively bridges the gap between offline and real-time neural decoding. REALM shows that LFP-only models can achieve competitive decoding performance without reliance on spike signals, offering a practical and scalable alternative for next-generation wireless implantable BCIs.

[AI-24] owards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

链接: https://arxiv.org/abs/2605.14866
作者: Lingzhe Zhang,Tong Jia,Kangjin Wang,Chiming Duan,Minghua He,Rongqian Wang,Xi Peng,Meiling Wang,Gong Zhang,Renhai Chen,Ying Li
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

[AI-25] Do Coding Agents Understand Least-Privilege Authorization?

链接: https://arxiv.org/abs/2605.14859
作者: Zheng Yan,Jingxiang Weng,Charles Chen,Dengyun Peng,Ethan Qin,Jiannan Guan,Jinhao Liu,Qiming Yu,Yixin Yuan,Fanqing Meng,Carl Che,Mengkang Hu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive this http URL study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a model maps a task instruction and terminal environment to a file-level read/write/execute policy, and AuthBench, a benchmark of 120 realistic terminal tasks with human-reviewed permission labels and executable validators for utility and attack this http URL shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive this http URL inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or this http URL suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary this http URL therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and this http URL tested models, this decomposition improves sensitive-task success by up to 15.8% on tightness-biased models while reducing attack success across all evaluated models.

[AI-26] Exploitation of Hidden Context in Dynamic Movement Forecasting: A Neural Network Journey from Recurrent to Graph Neural Networks and General Purpose Transformers

链接: https://arxiv.org/abs/2605.14855
作者: Lukas Schelenz,Shobha Rajanna,Denis Gosalci,Lucas Heublein,Jonas Pirkl,Jonathan Ott,Felix Ott,Christopher Mutschler,Tobias Feigl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 12 pages

点击查看摘要

Abstract:Forecasting within signal processing pipelines is crucial for mitigating delays, particularly in predicting the dynamic movements of objects such as NBA players. This task poses significant challenges due to the inherently interactive and unpredictable nature of sports, where abrupt changes in velocity and direction are prevalent. Traditional approaches, including (S)ARIMA(X), Kalman filters (KF), and Particle filters (PF), often struggle to model the non-linear dynamics present in such scenarios. Machine learning (ML) methods, such as long short-term memory (LSTM) networks, graph neural networks (GNNs), and Transformers, offer greater flexibility and accuracy but frequently fail to explicitly capture the interplay between temporal dependencies and contextual interactions, which are critical in chaotic sports environments. In this paper, we evaluate these models and assess their strengths and weaknesses. Experimental results reveal key performance trade-offs across input history length, generalizability, and the ability to incorporate contextual information. ML-based methods demonstrated substantial improvements over linear models across forecast horizons of up to 2s. Among the tested architectures, our hybrid LSTM augmented with contextual information achieved the lowest final displacement error (FDE) of 1.51m, outperforming temporal convolutional neural network (TCNN), graph attention network (GAT), and Transformers, while also requiring less data and training time compared to GAT and Transformers. Our findings indicate that no single architecture excels across all metrics, emphasizing the need for task-specific considerations in trajectory prediction for fast-paced, dynamic environments such as NBA gameplay.

[AI-27] XFP: Quality-Targeted Adaptive Codebook Quantization with Sparse Outlier Separation for LLM Inference

链接: https://arxiv.org/abs/2605.14844
作者: Thomas Witt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages, 3 figures, 17 tables, 1 algorithm. Code: this https URL

点击查看摘要

Abstract:We introduce XFP, a dynamic weight quantizer for LLM inference that inverts the conventional workflow: the operator specifies reconstruction quality floors on per-channel cosine similarity (one strict floor for attention and shared experts, one lazy floor for routed-expert MoE); XFP determines codebook size, outlier budget, and packing per layer automatically – no Hessian, no calibration data, no manual bit-width selection. Each weight matrix is decomposed into a sparse fp16 outlier residual and a dense sub-byte index tensor into a per-group learned codebook. Two storage modes share one auto-select frontend and one fused decode kernel: V2 (per-channel Lloyd) and V2a (shared library of L=32 codebooks per layer). On Qwen3.5-122B-A10B under V2, XFP reaches 138 tok/s single-stream decode on workstation hardware (RTX PRO 6000 Blackwell, TP=2) at 94.49% GSM8K strict-match (3 seeds, n=3957), and is 49% faster than Marlin INT4 at TP=1. For models that do not fit in the target memory envelope, we present the H-Process: a quality-driven iteration over the two cosine thresholds that finds the operating point at which the model just fits while still producing sensible output. Three constraints define its search space: the operator-set thresholds, an OOM boundary at quantize-on-load, and a garbage boundary in generation (cosine similarity steers; benches verify). On Qwen3.5-397B-A17B (512 routed experts/layer), the H-Process fits the full expert population into 2x96 GB at ~3.4 effective bits and delivers 100.9 tok/s long-output decode at 66.72% GSM8K strict-match on the full 1319-problem set (single seed at submission; multi-seed evaluation in progress), exceeding INT4 with routed-expert pruning on memory, throughput, and accuracy simultaneously.

[AI-28] GPart: End-to-End Isometric Fine-Tuning via Global Parameter Partitioning

链接: https://arxiv.org/abs/2605.14841
作者: Paolo Mandica,Michał Brzozowski,Zuzanna Dubanowska,Neo Christopher Chung
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Low-rank adaptation (LoRA) has become the dominant paradigm for parameter-efficient fine-tuning (PEFT) of large language models (LLMs). However, its bilinear structure introduces a critical limitation: the mapping from trainable parameters to weight updates is not distance-preserving, distorting the optimization landscape. Methods that project a low-dimensional vector into LoRA’s parameter space, such as Uni-LoRA, improve parameter efficiency, but the subsequent bilinear LoRA map breaks end-to-end isometry, leaving the core distance-preservation problem unresolved. We propose GPart (Global Partition fine-tuning), a highly parameter-efficient fine-tuning method which removes the low-rank bottleneck entirely. Our method uses a single isometric partition matrix to map a d -dimensional trainable vector directly into the full weight space of the model. The result is an extremely minimal fine-tuning pipeline: one random projection, end-to-end isometric, with a single clean hyperparameter ( d ) and storage cost of d+1 values (the trainable vector plus a random seed). GPart builds on the theoretical premise that effective fine-tuning can emerge from random low-dimensional subspaces of the full weight space, without imposing low-rank matrix structure. We empirically demonstrate the superior or comparable performance of GPart to existing PEFT methods on natural language understanding, computer vision tasks, and mathematical reasoning. Overall, GPart achieves state-of-the-art efficiency and performance by removing structural constraints, offering a straightforward and elegant path to PEFT.

[AI-29] Interestingness as an Inductive Heuristic for Future Compression Progress

链接: https://arxiv.org/abs/2605.14831
作者: Vincent Herrmann,Jürgen Schmidhuber
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:One of the bottlenecks on the way towards recursively self-improving systems is the challenge of interestingness: the ability to prospectively identify which tasks or data hold the potential for future progress. We formalize interestingness as an inductive heuristic for future compression progress and investigate its predictability using tools from Kolmogorov Complexity and Algorithmic Statistics. By analyzing complexity-runtime profiles under Length, Algorithmic, and Speed priors, we demonstrate that the inductive property of interestingness – the capacity for past progress to signal future discovery – is theoretically viable and empirically supported. We prove that expected future progress depends exponentially on the recency of the last observed breakthrough. Furthermore, we show that the Algorithmic Prior is significantly more optimistic than the Length Prior, yielding a quadratic increase in expected discovery for the same observed profile. These findings are experimentally confirmed across three diverse universal computational paradigms.

[AI-30] A Heterogeneous Temporal Memory Governance Framework for Long-Term LLM Persona Consistency

链接: https://arxiv.org/abs/2605.14802
作者: Zhao Yang,Wang Huan,Li Yingshuo,Tu Haomiao,Lin Hujite
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 5 figures, 2 tables. Preprint version. Code for ARPM v4.0 is available at: this https URL

点击查看摘要

Abstract:Large language models often suffer from fact loss, timeline confusion, persona drift, and reduced stability during long-range interaction, especially under high-noise knowledge bases, context clearing, and cross-model transfer. To address these issues, we introduce ARPM, an external temporal memory governance framework for long-term dialogue. ARPM separates static knowledge memory from dynamic dialogue experience memory and combines vector retrieval, BM25, RRF fusion, dual-temporal reranking, chronological evidence reading, and a controlled analysis protocol for evidence verification and answer binding. Unlike approaches that encode persona consistency into model weights or rely only on long context, ARPM treats continuity as a traceable, auditable, and transferable governance problem. Using engineering logs, we conduct three experiments. First, in a 50-round question-answering setting, we compare signal-to-noise ratios of 1:5 and 1:200+, and distinguish CSV auto-judgment from manual review. Under 1:5, CSV recall accuracy is 54.0%, while manual review raises it to 100.0%. Under 1:200+, the values are 44.0% and 80.0%. These results show that automatic rules can underestimate recall after supporting evidence enters the prompt. Second, ablation results show that dialogue history retrieval is necessary for recent continuity: disabling it reduces strict accuracy from 100% to 66.7%, and disabling BM25 reduces it to 80.0%, indicating that pure semantic retrieval is insufficient for correction and tracing. Third, under a 5.1-million-character noise substrate, periodic context clearing, and multi-model handoff, ARPM maintains semantic continuity, boundary continuity, and persona consistency, while exposing limits caused by weak protocol compliance. These findings show that long-term persona consistency can be decomposed into governable components and evaluated in a white-box manner.

[AI-31] Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

链接: https://arxiv.org/abs/2605.14774
作者: Lata B T,Savitha N J
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the world of AI and advanced technologies investigation aspects identification of a crime or criminal plays a major problem. In this research we focus on a Conventional ways of implicating criminal investigations usually rely on limited data analysis. Finding an optimal and efficient method that will effectively identify criminals from complex datasets and minimise false positives and false negatives is the considered as a challenge. The main novelty approach of this work is based on the deep learning algorithm Deep Deterministic Policy Gradient (DDPG) is presented in this paper. We train the DDPG model with a dataset of crime scene material, witness statements and suspect profiles. The algorithm uses features to maximise the likelihood of identifying the offender while minimising the noise impact and irrelevant data. We show the efficacy of the proposed method, where DDPG identified criminals with an amazing accuracy of 95% than other several existing methods.

[AI-32] Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

链接: https://arxiv.org/abs/2605.14773
作者: Suorong Yang,Hanqi Zhu,Hai Gan,Fangjian Su,Guang Li,Furao Shen,Soujanya Poria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

[AI-33] MediaClaw: Multimodal Intelligent-Agent Platform Technical Report

链接: https://arxiv.org/abs/2605.14771
作者: Shaoan Zhao,Huanlin Gao,Qiang Hui,Ting Lu,Xueqiang Guo,Yantao Li,Xinpei Su,Fuyuan Shi,Chao Tan,Fang Zhao,Kai Wang,Shiguo Lian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MediaClaw is a multimodal agent platform built on the OpenClaw ecosystem. Its core design follows a three-layer architecture of unified abstraction, pluginized extension, and workflow orchestration. The system is intended to address practical deployment pain points in AIGC adoption, including fragmented capabilities, heterogeneous interfaces, disconnected production processes, and limited reuse of high-quality production workflows. \system abstracts full-category AIGC capabilities into a unified invocation model, uses plugins to support hot-pluggable capability expansion, and uses task-oriented Skills to turn complex production processes into reusable workflow assets. This report focuses on the architectural design philosophy of MediaClaw, the design logic of its core capability model, and the key engineering trade-offs in implementation. It aims to provide reusable practical reference for building multimodal capability platforms.

[AI-34] Compositional Sparsity as an Inductive Bias for Neural Architecture Design

链接: https://arxiv.org/abs/2605.14764
作者: Hongyu Lin,Antonio Briola,Yuanrong Wang,Tomaso Aste
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Identifying the structural priors that enable Deep Neural Networks (DNNs) to overcome the curse of dimensionality is a fundamental challenge in machine learning theory. Existing literature suggests that effective high-dimensional learning is driven by compositional sparsity, where target functions decompose into constituents supported on low-dimensional variable subsets. To investigate this hypothesis, we combine Information Filtering Networks (IFNs), which extract sparse dependency structures via constrained information maximisation, with Homological Neural Networks (HNNs), which map the inferred topology into fixed-wiring sparse neural graphs. We formalise the design principles underlying this construction and present an interpretable pipeline in which abstraction emerges through hierarchical composition. HNNs are orders of magnitude sparser than standard DNNs and require only minimal hyperparameter tuning. On synthetic tasks with known sparse hierarchies, HNNs recover the underlying compositional structure and remain stable in regimes where dense alternatives degrade as dimensionality increases. Across a broad suite of real-world datasets, HNNs consistently match or outperform dense baselines while using far fewer parameters, exhibiting lower variance and showing reduced sensitivity to hyperparameters.

[AI-35] Probabilistic Verification of Recurrent Neural Networks for Single and Multi-Agent Reinforcement Learning IJCAI

链接: https://arxiv.org/abs/2605.14758
作者: Luca Marzari,Enrico Marchesini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI) 2026

点击查看摘要

Abstract:History-dependent policies induced by recurrent neural networks (RNNs) rely on latent hidden state dynamics, making verification in partially observable reinforcement learning (RL) challenging. Existing RNN verification tools typically rely on restrictive modeling assumptions or coarse over-approximations of the hidden state space, which can lead to overly conservative or inconclusive results. We propose \textbfRNN \textbfPro babilistic \textbfVe rification ( \textttRNN-ProVe ), a probabilistic framework that \textitestimates the likelihood of undesired behaviors in RNN-based policies. \textttRNN-ProVe uses policy-driven sampling to approximate the set of hidden states that are feasible under a trained policy, and derives statistical error bounds to produce bounded-error, high-confidence estimates of behavioral violations. Experiments on partially observable single-agent and cooperative multi-agent tasks show that \textttRNN-ProVe yields more quantitative, feasibility-aware probabilistic guarantees than existing tools, while scaling to recurrent and multi-agent settings.

[AI-36] XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

链接: https://arxiv.org/abs/2605.14754
作者: Gong Zhiren,Tiantong Wu,Jiaming Zhang,Fuyao Zhang,Che Wang,Yurong Hao,Yikun Hou,Foo Ping,Yilei Zhao,Fei Huang,Chau Yuen,Wei Yang Bryan Lim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed for knowledge synthesis, yet their capacity for compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks primarily focus on single-turn restricted scenarios, failing to capture the capability boundaries exposed by real-world interactive scientific workflows. To address this, we introduce XDomainBench, a diagnostic benchmark for interactive interdisciplinary scientific reasoning. We formalize the composition order and mixture structure to enable systematic stress-testing from single-discipline to inter-disciplinary, comprising 8,598 interactive sessions across 20 domains and 4 task categories, with 8 realistic trajectory patterns covering difficulty and domain-mixture dynamics, simulating real AI4S scenarios. Large-scale evaluation of LLMs reveals a systematic reasoning collapse as composition order increases, stemming from two root causes: (i) direct difficulty increases induced by domain composition, and (ii) indirect interaction-amplified failures where trajectory patterns trigger error accumulation, reasoning breaks, and domain confusion, ultimately leading to session collapse.

[AI-37] Cognitive-Uncertainty Guided Knowledge Distillation for Accurate Classification of Student Misconceptions ACL2026

链接: https://arxiv.org/abs/2605.14752
作者: Qirui Liu,Hao Chen,Weijie Shi,Jiajie Xu,Jia Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ACL 2026 Findings. 10 pages, 5 figures, 19 tables

点击查看摘要

Abstract:Accurately identifying student misconceptions is crucial for personalized education but faces three challenges: (1) data scarcity with long-tail distribution, where authentic student reasoning is difficult to synthesize; (2) fuzzy boundaries between error categories with high annotation noise; (3) deployment parado-large models overlook unconventional approaches due to pretraining bias and cannot be deployed on edge, while small models overfit to noise. Unlike traditional methods that increase diversity through large-scale data synthesis, we propose a two-stage knowledge distillation framework that mines high-value samples from existing data. The first stage performs standard distillation to transfer task capabilities. The second stage introduces a dual-layer marginal selection mechanism based on cognitive uncertainty, identifying four types of critical samples based on teacher model uncertainty and confidence differences. For different data subsets, we design difficulty-adaptive mechanism to balance hard/soft label contributions, enabling student models to inherit inter-class relationships from teacher soft labels while distinguishing ambiguous error types. Experiments show that with augmented training on only 10.30% of filtered samples, we achieve MAP@3 of 0.9585 (+17.8%) on the MAP-Charting dataset, and using only a 4B parameter model, we attain 84.38% accuracy on cross-topic tests of middle school algebra misconception benchmarks, significantly outperforming sota LLM (67.73%) and standard fine-tuned 72B models (81.25%). Our code is available at this https URL.

[AI-38] EVA: Editing for Versatile Alignment against Jailbreaks

链接: https://arxiv.org/abs/2605.14750
作者: Yi Wang,Hongye Qiu,Yue Xu,Sibei Yang,Zhan Qin,Minlie Huang,Wenjie Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: IEEE TPAMI 2026

点击查看摘要

Abstract:Large Language Models (LLMs) and Vision Language Models (VLMs) have demonstrated impressive capabilities but remain vulnerable to jailbreaking attacks, where adversaries exploit textual or visual triggers to bypass safety guardrails. Recent defenses typically rely on safety fine-tuning or external filters to reduce the model’s likelihood of producing harmful content. While effective to some extent, these methods often incur significant computational overheads and suffer from the safety utility trade-off, degrading the model’s performance on benign tasks. To address these challenges, we propose EVA (Editing for Versatile Alignment against Jailbreaks), a novel framework that pioneers the application of direct model editing for safety alignment. EVA reframes safety alignment as a precise knowledge correction task. Instead of retraining massive parameters, EVA identifies and surgically edits specific neurons responsible for the model’s susceptibility to harmful instructions, while leaving the vast majority of the model unchanged. By localizing the updates, EVA effectively neutralizes harmful behaviors without compromising the model’s general reasoning capabilities. Extensive experiments demonstrate that EVA outperforms baselines in mitigating jailbreaks across both LLMs and VLMs, offering a precise and efficient solution for post-deployment safety alignment.

[AI-39] Addressing Terminal Constraints in Data-Driven Demand Response Scheduling

链接: https://arxiv.org/abs/2605.14741
作者: Maximilian Bloor,Martha White,Ehecatl Antonio del Rio Chanona,Calvin Tsay
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注: Accepted to IFAC World Congress 2026

点击查看摘要

Abstract:Electrified chemical processes are incentivized by exposure to time-varying electricity markets to operate flexibly, but participating in demand response schemes can require satisfying terminal constraints over long horizons. Specifically, terminal constraints may be required when computing optimal schedules in order to preserve dynamic stability. Model-based optimization methods are computationally costly, and data-driven scheduling via reinforcement learning (RL) faces severe credit-assignment challenges. We integrate Goal-Space Planning (GSP) with Deep Deterministic Policy Gradient (DDPG), using learned temporally abstract models over discrete subgoals to propagate value across extended horizons. Using a simulated air separation benchmark, we demonstrate the proposed approach improves sample efficiency over standard DDPG while satisfying terminal storage constraints, mitigating myopic control behavior.

[AI-40] APIOCA: Why Task- Aware Pruning Improves OOD model Capability

链接: https://arxiv.org/abs/2605.14738
作者: Krish Sharma,Omar Naim,Soumadeep Saha,Nicholas Asher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model’s task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

[AI-41] On Strong Equivalence Notions in Logic Programming and Abstract Argumentation

链接: https://arxiv.org/abs/2605.14721
作者: Giovanni Buraglio,Wolfgang Dvorak,Stefan Woltran
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Strong equivalence between knowledge bases ensures the possibility of replacing one with the other without affecting reasoning outcomes, in any given context. This makes it a crucial property in nonmonotonic formalisms. In particular, the fields of logic programming and abstract argumentation provide primary examples in which this property has been subject to vast investigations. However, while (classes of) logic programs and abstract argumentation frameworks are known to be semantically equivalent in static settings, this alignment breaks in dynamic contexts due to differing notions of update. As a result, strong equivalence does not always carry over from one formalism to the other. In this paper, we carefully investigate this discrepancy and introduce a new notion of strong equivalence for logic programs. Our approach preserves strong equivalence under translation between certain classes of logic programs and both Dung-style and claim-augmented argumentation frameworks, thus restoring compatibility across these formalisms.

[AI-42] NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces

链接: https://arxiv.org/abs/2605.14698
作者: Konstantinos Kontras,Trui Osselaer,Stylianos G. Mouslech,Angeliki-Ilektra Karaiskou,Guido Gagliardi,Thomas Strypsteen,Mohammad Hossein Badiei,Anku Rani,Maarten Vanmarcke,Miguel Bhagubai,Chanakya Ekbote,Jaedong Hwang,Christos Chatzichristos,Paul Pu Liang,Maarten De Vos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the EEG-specific preprocessing that might influence reported results, and in the reported metrics, frequently obscuring the clinical relevance in EEG. We introduce NeuroAtlas, the largest EEG benchmark to date: 42 datasets and 260k hours covering clinical EEG (epilepsy, sleep medicine, brain age estimation) and brain-computer interfaces, and include multiple datasets per task along with bespoke clinical evaluation metrics. Besides evaluating EEG-FMs with respect to supervised baselines, we present results from generic time-series FMs. We report three findings. First, EEG-specific FMs do not consistently outperform time-series FMs, which have neither EEG-focused architectures nor been pretrained on EEG. Second, standard machine learning metrics are insufficient to assess clinical utility: thus, we thoroughly evaluate more appropriate measures such as the quality of event-level decision-making, hypnogram-derived features, and the brain-age gap in the domains of epilepsy, sleep, and brain age, respectively. Third, model rankings and performance can vary substantially within domains. We conclude that pretrained models perform largely on par, with only narrow advantages for a few, and that current models do not yet deliver on the promise of an out-of-the-box unified EEG model. NeuroAtlas exposes this gap and provides the datasets and metrics for the next generation of unified EEG FMs.

[AI-43] Spontaneous symmetry breaking and Goldstone modes for deep information propagation

链接: https://arxiv.org/abs/2605.14685
作者: Nabil Iqbal,T. Anderson Keller,Yue Song,Takeru Miyato,Max Welling
机构: 未知
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注: 28 pages. Code at this https URL

点击查看摘要

Abstract:In physical systems, whenever a continuous symmetry is spontaneously broken, the system possesses excitations called Goldstone modes, which allow coherent information propagation over long distances and times. In this work, we study deep neural networks whose internal layers are equivariant under a continuous symmetry and may therefore support analogous Goldstone-like degrees of freedom. We demonstrate, both analytically and empirically, that these degrees of freedom enable coherent signal propagation across depth and recurrent iterations, providing a mechanism for stable information flow without relying on architectural stabilizers such as residual connections or normalization. In feedforward networks, this results in improved trainability and representational diversity across layers. In recurrent settings, we demonstrate the same mechanism is valuable for long-term memory by propagating information over recurrent iterations, thereby improving performance of RNNs and GRUs on long-sequence modeling tasks.

[AI-44] π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

链接: https://arxiv.org/abs/2605.14678
作者: Haoran Zhang,Luxin Xu,Zhilin Wang,Runquan Gui,Shunkai Zhang,Haodi Lei,Zihao He,Bingsu He,Chicheng Qin,Tong Zhu,Xiaoye Qu,Yang Yang,Yu Cheng,Yafu Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 44 pages

点击查看摘要

Abstract:The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce \pi -Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, \pi -Bench evaluates agents’ ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

[AI-45] How Sensitive Are Radiomic AI Models to Acquisition Parameters?

链接: https://arxiv.org/abs/2605.14667
作者: D. Gil,I. Sanchez,C. Sanchez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying clinically significant parameter regions associated with improved cross-dataset robustness. We formulate a mixed-effects framework for quantifying the influence that clinically relevant acquisition parameters have on models performance, while accounting for subject-level random effects. We have applied our framework to lung cancer diagnosis in CT scans using two independent multicentre datasets (a public database and own-collected data) and several SoA architectures. To evaluate across-database reproducibility, CT parameters have been adjusted using the data collected and tested on the public set. The optimal configuration selected is the current of the X-ray tube = 200 mA, spiral pitch = 1.5, slice thickness = 1.25 mm, which balances diagnostic quality with low radiation dose. These configuration push metrics from 0.79±0.04 sensitivity, 0.47±0.10 specificity in low quality scans to 0.90±0.10 sensitivity, 0.79 ± 0.13 specificity in high quality ones.

[AI-46] Monitoring Data-aware Temporal Properties (Extended Version) IJCAI2026

链接: https://arxiv.org/abs/2605.14666
作者: Alessandro Gianola,Marco Montali,Sarah Winkler
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This is the extended version of a paper accepted to IJCAI 2026

点击查看摘要

Abstract:Dynamic systems in AI are often complex and heterogeneous, so that an internal specification is not accessible and verification techniques such as model checking are not applicable. Monitoring is in such cases an attractive alternative, as it evaluates desirable properties along traces generated by an unknown dynamic system. In this work, we consider anticipatory monitoring of linear-time properties enriched with an arbitrary SMT theory over finite traces (LTLfMT). Anticipatory monitoring in this setting is highly challenging, as the monitoring state depends on both the trace prefix seen so far and all its possible finite continuations. Under reasonable assumptions on the background theory, we present and formally prove the correctness of a novel foundational framework for monitoring properties in an expressive fragment of LTLfMT. The framework combines automata-theoretic methods to handle the temporal aspects of the logic, with automated reasoning techniques to address the first-order dimension. Moreover, we identify for the first time decidable fragments of this monitoring problem that are practically relevant as they combine linear arithmetic with uninterpreted functions, which covers e.g. data-aware business processes and dynamic systems operating over a read-only database. Feasibility is witnessed by a prototype implementation and preliminary evaluation.

[AI-47] MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder

链接: https://arxiv.org/abs/2605.14660
作者: Eranga Bandara,Ross Gore,Asanga Gunaratna,Ravi Mukkamala,Nihal Siriwardanagea,Sachini Rajapakse,Isurunima Kularathna,Pramoda Karunarathna,Wathsala Herath,Chalani Rajapakse,Sachin Shetty,Anita H. Clayton,Christopher K. Rhea,Ng Wee Keong,Kasun De Zoysa,Amin Hass,Shaifali Kaushik,Preston Samuel,Atmaram Yarlagadda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing therapeutic approaches, prolonged exposure, EMDR, cognitive behavioural therapy, operate predominantly downstream of the reactive cascade, teaching patients to tolerate or reframe distress after it has arisen. While clinically valuable, these suppression-based approaches do not produce the upstream pathway dissolution that constitutes lasting structural neural reorganisation. This paper proposes MindGap, a privacy-preserving on-device conversational AI framework that delivers structured neuroplastic rehabilitation for PTSD through the practice of dependent origination, a Buddhist psychological framework that identifies the precise moment between the pre-cognitive affective signal and the reactive elaboration that follows as the site of therapeutic intervention. MindGap guides patients through three progressive layers of observation at this feeling tone gap: noticing the bare affective signal before reactive elaboration, recognising it as self-arising rather than caused by the stimulus, and recognising the conditioned implicit belief beneath the feeling. Each layer corresponds to progressively deeper prefrontal regulatory engagement and progressively deeper long-term depression-mediated weakening of the reactive pathway, producing genuine upstream dissolution rather than downstream suppression. Running entirely on-device with no data egress, MindGap delivers daily calibrated exposure sessions through a fine-tuned lightweight large language model, making it deployable in sensitive clinical and military contexts where cloud-based solutions are not permitted.

[AI-48] aching Large Language Models When Not to Know: Learning Temporal Critique for Ex-Ante Reasoning

链接: https://arxiv.org/abs/2605.14636
作者: Chenlu Ding,Jiancan Wu,Yanchen Luo,Zheyuan Liu,Yancheng Yuan,Xiang Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often fail to reason under temporal cutoffs: when prompted to answer from the standpoint of an earlier time, they exploit knowledge that became available only later. We study this failure through the lens of ex-ante reasoning, where a model must rely exclusively on information knowable before a cutoff. Through a systematic analysis of prompt-level interventions, we find that temporal leakage is highly sensitive to cutoff formulation and instruction placement: explicit cutoff statements outperform implicit historical framings, and prefix constraints reduce leakage more effectively than suffix constraints. These findings indicate that prompting can steer models into a temporal frame, but does not endow them with the ability to verify whether a response is temporally admissible. We further argue that supervised fine-tuning is insufficient, since ex-ante correctness is not an intrinsic property of an answer, but a relation between the answer and the cutoff. To address this gap, we propose TCFT, a Temporal Critique Fine-Tuning framework that trains models to acquire cutoff-aware temporal verification. Given a query, a cutoff, and a candidate response, TCFT teaches the model to identify post-cutoff leakage, explain temporal boundary violations, and judge temporal admissibility. Experiments with Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct show that TCFT consistently outperforms prompting and SFT baselines, reducing average leakage by 41.89 and 37.79 percentage points, respectively.

[AI-49] An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

链接: https://arxiv.org/abs/2605.14624
作者: Sohaib Afifi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 16 pages, 5 figures, 3 tables. v0.1: framework + measurement protocol instantiated at n=20; empirical extension to larger problem sizes deferred to v0.2

点击查看摘要

Abstract:A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from “training is expensive” to “neural solvers are net-inefficient”, which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per-instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the Multi-Task VRP (MTVRP) environment at n=20 customers across 19 problem variants and five training seeds, with HGS via PyVRP as the heuristic baseline. The measured crossover sits near 1.58 \times 10^5 deployed instances; the per-instance ratio is 0.41, reflecting the moderate size of the instances tested. The contribution is the framework, the open instrumentation, and the measurement protocol; structural convergence of the ratio at larger problem sizes is left to future empirical work.

[AI-50] SliceGraph: Mapping Process Isomers in Multi-Run Chain-of-Thought Reasoning

链接: https://arxiv.org/abs/2605.14619
作者: Kang Chen,Junjie Nian,Yixin Cao,Yugang Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-run chain-of-thought reasoning is usually collapsed to final-answer aggregates, which discard howsampled trajectories share, split, and rejoin through intermediate computation. We propose SliceGraph, a post-hoc problem-model-cell graph built by mutual-kNN over sparse activation-key Jaccard similarity between CoT slices, and treat it as a measurement object for process geometry rather than as a decoding program. Across sampled CoT ensembles from three primary 4B/8B models on math and science benchmarks, blinded annotation supports SliceGraph biconnected components as shared reasoning-state units and process families as within-family strategy-coherent route units. In 85.5% of 954 problem-model cells, correct CoTs sharing the same normalized answer split into multiple process families; among cells with at least two such runs, 76.6% of run pairs are cross-family on average. We call such same-answer, family-divergent correct trajectories process isomers. A label-seeded reward field provides a separate value-landscape layer: success-associated regions often split into disconnected high-value cores, and route families specialize over these core footprints rather than merely duplicating one another. A typed-state transition analysis further shows that process families navigate the same atlas with distinct transition kernels under matched null controls. Representation ablations, a cross-architecture replication, and two cross-scale replications support the robustness of the route-family scaffold, showing that final-answer aggregation overlooks this structured multi-route process geometry.

[AI-51] In-IDE Toolkit for Developers of AI-Based Features ICSE’26

链接: https://arxiv.org/abs/2605.14612
作者: Yaroslav Sokolov,Yury Khudyakov,Lenar Sharipov,Andrei Gasparian,Parth Tiwary,Artem Trofimov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Published at IDE’26 co-located with ICSE’26

点击查看摘要

Abstract:AI-enabled features built on LLMs and agentic workflows are difficult to test, debug, and reproduce, especially for product-focused software engineers without a machine learning background. We present the AI Toolkit plugin for JetBrains IDEs, which brings tracing and evaluation directly into the Run/Debug loop. A mixed methods study with practitioners presents three consistent needs: (1) make evaluation regular and repeatable, (2) expose traces at the moment of execution, and (3) minimize setup and context switching. Guided by these needs, the AI Toolkit introduces an IDE-native workflow: run-triggered trace capture; immediate, hierarchical inspection; one-click “Add to Dataset” from traces; and unit-test-like evaluations with pluggable metrics. The first release in PyCharm shows promising early signals - strong conversion when promoted at Run, sustained usage among those who capture traces, and low churn - suggesting that IDE-native observability lowers activation energy and helps developers adopt disciplined practices. We detail the design and implementation of the AI Agents Debugger and AI Evaluation, report initial adoption telemetry, and outline next steps to broaden framework coverage and scale evaluations. Together, these results indicate that integrating AI observability and evaluation into everyday IDE workflows can make modern AI development accessible to non-ML specialists while preserving software-engineering practices.

[AI-52] One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

链接: https://arxiv.org/abs/2605.14605
作者: Itay Zloczower,Eyal Lenga,Gilad Gressel,Yisroel Mirsky
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Model providers increasingly release open weights or allow users to fine-tune foundation models through APIs. Although these models are safety-aligned before release, their safeguards can often be removed by fine-tuning on harmful data. Recent defenses aim to make models robust to such malicious fine-tuning, but they are largely evaluated only against fixed attacks that do not account for the defense. We show that these robustness claims are incomplete. Surveying 15 recent defenses, we identify several defense mechanisms and show that they share a single weakness: they obscure or misdirect the path to harmful behavior without removing the behavior itself. We then develop a unified adaptive attack that breaks defenses across all defense mechanisms. Our results show that current approaches do not provide robust security; they mainly stop the attacks they were designed against. We hope that our unified adaptive adversary for this domain will help future researchers and practitioners stress-test new defenses before deployment.

[AI-53] Fast Rates for Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2605.14599
作者: Andreas Schlaginhaufen,Maryam Kamgarpour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We establish novel structural and statistical results for entropy-regularized min-max inverse reinforcement learning (Min-Max-IRL) with linear reward classes in finite-horizon MDPs with Borel state and action spaces. On the structural side, we show that maximum likelihood estimation (MLE) and Min-Max-IRL are equivalent at the population level, and at the empirical level under deterministic dynamics. On the statistical side, exploiting pseudo-self-concordance of the Min-Max-IRL loss, we prove that both the trajectory-level KL divergence and the squared parameter error in the Hessian norm decay at the fast rate \mathcalO(n^-1) , where n is the number of expert trajectories. Our guarantees apply under misspecification and require no exploration assumptions. We further extend reward-identifiability results to general Borel spaces and derive novel results on the derivatives of the soft-optimal value function with respect to reward parameters.

[AI-54] Angel or Demon: Investigating the Plasticity Interventions Impact on Backdoor Threats in Deep Reinforcement Learning ICML2026

链接: https://arxiv.org/abs/2605.14587
作者: Oubo Ma,Ruixiao Lin,Yang Dai,Jiahao Chen,Chunyi Zhou,Linkang Du,Shouling Ji
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in the Forty-Third International Conference on Machine Learning (ICML 2026), July 6-11, 2026, Seoul, South Korea

点击查看摘要

Abstract:Extensive research has highlighted the severe threats posed by backdoor attacks to deep reinforcement learning (DRL). However, prior studies primarily focus on vanilla scenarios, while plasticity interventions have emerged as indispensable built-in components of modern DRL agents. Despite their effectiveness in mitigating plasticity loss, the impact of these interventions on DRL backdoor vulnerabilities remains underexplored, and this lack of systematic investigation poses risks in practical DRL deployments. To bridge this gap, we empirically study 14,664 cases integrating representative interventions and attack scenarios. We find that only one intervention (i.e., SAM) exacerbates backdoor threats, while other interventions mitigate them. Pathological analysis identifies that the exacerbation is attributed to backdoor gradient amplification, while the mitigation stems from activation pathway disruption and representation space compression. From these findings, we derive two novel insights: (1) a conceptual framework SCC for robust backdoor injection that deconstructs the mechanistic interplay between interventions and backdoors in DRL, and (2) abnormal loss landscape sharpness as a key indicator for DRL backdoor detection.

[AI-55] Prompt Segmentation and Annotation Optimisation: Controlling LLM Behaviour via Optimised Segment-Level Annotations

链接: https://arxiv.org/abs/2605.14561
作者: Devika Prasad,Luke Gerschwitz,Tong Li,Henry Xiao,Anjin Liu,Coco Wu,Anna Leontjeva,Luiz Pizzato
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt engineering is crucial for effective interaction with generative artificial intelligence systems, yet existing optimisation methods often operate over an unstructured and vast prompt space, leading to high computational costs and potential distortions of the original intent. We introduce Prompt Segmentation and Annotation Optimisation (PSAO), a structured prompt optimisation framework designed to improve prompt optimisation controllability and efficiency. PSAO decomposes a prompt into interpretable segments (e.g., sentences) and augments each with human-readable annotations (e.g., not important, important, very important). These annotations guide large language models (LLMs) in allocating focus and clarifying confusion during response generation. We formally define the segmentations and annotations and demonstrate that optimised segment-level annotations can lead to improved LLM responses, with the original prompt retained as a candidate in the optimisation space to prevent performance degradation. Empirical evaluations indicate that PSAO benefits from annotations in terms of improved reasoning accuracy and self-consistency. However, developing efficient methods for identifying optimal segmentations and annotations remains challenging and is reserved for future investigation. This work is intended as a proof of concept, demonstrating the feasibility and potential of segment-level annotation optimisation.

[AI-56] PyCSP3-Scheduling: A Scheduling Extension for PyCSP3

链接: https://arxiv.org/abs/2605.14559
作者: Sohaib Afifi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PyCSP ^3 provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP ^3 , preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP ^3 already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP ^3 Scheduling, a library that adds scheduling abstractions to PyCSP ^3 through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP ^3 /XCSP ^3 constraints, maintaining the modeling/solving separation that underpins the PyCSP ^3 ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: this https URL

[AI-57] achAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality

链接: https://arxiv.org/abs/2605.14556
作者: Zidong Liu,Rongkai Liu,Yue Li,Zhenliang Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures. Accepted as an IEEE VR 2026 Poster

点击查看摘要

Abstract:Symmetrical Reality (SR) is emerging as a future trend for human-agent coexistence, placing higher demands on agents to acquire human-like intelligence. It calls for richer and more diverse human guidance. We introduce a three-stage demonstration paradigm integrating multimodal demonstration signals. Building on this paradigm, we developed TeachAnything, a cloud-based, crowdsourcing-oriented demonstration platform with physics simulation capable of collecting diverse demonstration data across varied scenes, tasks, and embodiments. By unifying virtual and physical interactions through both methodological design and physics simulation, the system serves as a practical foundation for developing embodied agents aligned with Symmetrical Reality.

[AI-58] Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

链接: https://arxiv.org/abs/2605.14555
作者: Shuyang Cui,Zhi Zhong,Qiyu Wu,Zachary Novack,Woosung Choi,Keisuke Toyama,Kin Wai Cheuk,Junghyun Koo,Yukara Ikemiya,Christian Simon,Chihiro Nagashima,Shusuke Takahashi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current methods for creating drum loop audio in digital music production, such as using one-shot samples or resampling, often demand non-trivial efforts of creators. While recent generative models achieve high fidelity and adhere to text, they lack the specific control needed for such a task. Existing symbolic-to-audio research often focuses on single, tonal instruments, leaving the challenge of polyphonic, percussive drum synthesis unaddressed. We address this gap by introducing ``Break-the-Beat!,‘’ a model capable of rendering a drum MIDI with the timbre of a reference audio. It is built by fine-tuning a pre-trained text-to-audio model with our proposed content encoder and a effective hybrid conditioning mechanism. To enable this, we construct a new dataset of paired target-reference drum audio from existing drum audio datasets. Experiments demonstrate that our model generates high-quality drum audio that follows high-resolution drum MIDI, achieving strong performance across metrics of audio quality, rhythmic alignment, and beat continuity. This offer producers a new, controllable tool for creative production. Demo page: this https URL

[AI-59] Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits ICLR2026

链接: https://arxiv.org/abs/2605.14553
作者: Donghao Li,Chengshuai Shi,Weijuan Ou,Cong Shen,Jing Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published as a conference paper at ICLR 2026

点击查看摘要

Abstract:Prompt engineering has become central to eliciting the capabilities of large language models (LLMs). At its core lies prompt selection – efficiently identifying the most effective prompts. However, most prior investigations overlook a key challenge: the inherently multi-faceted nature of prompt performance, which cannot be captured by a single metric. To fill this gap, we study the multi-objective prompt selection problem under two practical settings: Pareto prompt set recovery and best feasible prompt identification. Casting the problem into the pure-exploration bandits framework, we adapt provably efficient algorithms from multi-objective bandits and further introduce a novel design for best feasible arm identification in structured bandits, with theoretical guarantees on the identification error in the linear case. Extensive experiments across multiple LLMs show that the bandit-based approaches yield significant improvements over baselines, establishing a principled and efficient framework for multi-objective prompt optimization.

[AI-60] Complacent Not Sycophantic: Reframing Large Language Models and Designing AI Literacy for Complacent Machines

链接: https://arxiv.org/abs/2605.14544
作者: Federico Germani,Giovanni Spitale
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are often described as sycophantic, in the sense that they appear to flatter users or mirror their beliefs. We argue that this label is conceptually misleading: sycophancy implies motives and strategic intent, which LLMs do not possess. Their behaviour is better understood as complacency, a structural tendency to agree with user input because training data, reward signals and design favour agreement and reinforcement over correction. We argue that this distinction matters. Whether developers act sycophantically or not, models themselves never are sycophants; they can only be made more or less complacent. This reframing locates agency in developers and institutions, not in the model. Because complacent models reinforce users’ prior beliefs, we argue that AI literacy educational approaches should particularly focus on strategies to counter confirmation bias.

[AI-61] RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation

链接: https://arxiv.org/abs/2605.14543
作者: Shuhao Chen,Weisen Jiang,Changmiao Wang,Xiaoqing Wu,Xuanren Shi,Yu Zhang,James T. Kwok
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Inpatient medication recommendation requires clinicians to repeatedly select specific medications, doses, and routes as a patient’s condition evolves. Existing benchmarks formulate this task as admission-level prediction over coarse drug codes with multi-hot diagnostic and procedure code inputs, failing to capture the per-timepoint, information-rich nature of real prescribing. We propose RxEval, a prescription-level benchmark that evaluates LLM prescribing capability by multiple-choice questions: each question presents a detailed patient profile and time-ordered clinical trajectory, requiring selection of specific medication-dose-route triples from real prescriptions and patient-specific distractors generated via reasoning-chain perturbation. RxEval comprises 1,547 questions spanning 584 patients, 18 diagnostic categories, and 969 unique medications. Evaluation of 16 LLMs shows that RxEval is both challenging and discriminative: F1 ranges from 45.18 to 77.10 across models, and the best Exact Match is only 46.10%. Error analysis reveals that even frontier models may overlook stated patient information and fail to derive clinical conclusions.

[AI-62] VerbalValue: A Socially Intelligent Virtual Host for Sales-Driven Live Commerce CVPR2026

链接: https://arxiv.org/abs/2605.14542
作者: Yuyan Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the CVPR 2026 HiGen Workshop

点击查看摘要

Abstract:A skilled live-commerce host is not merely a narrator, but a sales agent who converts viewer curiosity into purchase intent through expert product knowledge, emotionally intelligent response tactics, and entertainment that serves as a vehicle for product exposure. Yet no existing AI system replicates this: conversational recommenders treat recommendation as a terminal act, while general-purpose LLMs hallucinate product claims and default to generic promotional templates that fail to engage or persuade. We present VerbalValue, a sales-conversion-oriented virtual host that turns exceptional verbal ability into real commercial value, built on three contributions. First, we construct a domain knowledge base of product specifications and a curated sales terminology lexicon that anchor product-related responses in verified expertise. Second, we collect and annotate 1,475 live-commerce interactions spanning diverse viewer intents. Third, we fine-tune a large language model on this data to deliver empathetic, commercially oriented responses, adapting to viewer intent through empathetic amplification, evidence-backed rebuttal, and humor-mediated deflection. Experiments against GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, and other baselines demonstrate gains of 23% on informativeness and 18% on factual correctness, with consistent advantages in tactfulness and viewer engagement.

[AI-63] Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing Bidding and Bargaining ICLR2026

链接: https://arxiv.org/abs/2605.14537
作者: Robert Müller,Clemens Müller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: malgai workshop at iclr 2026

点击查看摘要

Abstract:We introduce \textscCattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50–60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \textscCattle Trade evaluates whether agents integrate them across a competitive, multi-agent economic game with conflicting incentives. The benchmark logs every bid, TC offer, counteroffer, and card selection, enabling behavioural analysis beyond final scores or win rates. We evaluate seven cost-efficient language models and three deterministic code agents across 242 games. Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs, and behavioural traces surface recurring LLM failure modes including overbidding, self-bidding, bankrupt TC initiation, and weak opponent-state adaptation. Evaluating agentic competence requires benchmarks that test the joint deployment of multiple capabilities in multi-agent environments with conflicting incentives, uncertainty, and economic dynamics.

[AI-64] When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

链接: https://arxiv.org/abs/2605.14504
作者: Zilin Zhu,Longteng Guo,Yanghong Mei,Bowen Pang,Zongxun Zhang,Xingjian He,Ruyi Ji,Jing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

[AI-65] Quantifying Cyber-Vulnerability in Power Electronics Systems via an Impedance-Based Attack Reachable Domain

链接: https://arxiv.org/abs/2605.14502
作者: Hongwei Zhen,Ze Yu,Xin Xiang,Wuhua Li,Mingyang Sun
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Power electronics systems are increasingly exposed to cyber threats due to their integration with digital controllers and communication networks. However, an attacker-oriented metric is still lacking to quantify the extent to which a node can be pushed toward instability within a privilege-constrained action space. This letter proposes an impedance-based Attack Reachable Domain (ARD) framework that maps feasible adversarial actions to critical-eigenvalue migration through impedance reshaping. Based on the ARD, an Attack Penetration Index is defined to quantify node-level cyber-vulnerability by jointly characterizing the penetration of the nominal stability margin and the accessibility of successful destabilizing attacks within a privilege-constrained action space. To make the proposed assessment computable when inverter models are unavailable, a practical gray-box workflow is further established by integrating existing impedance identification and differentiable surrogate tools. Case studies on a 4-bus system and a modified IEEE 39-bus system show that coordinated cross-layer manipulations are markedly more damaging than isolated single-layer attacks, and that the proposed metric reveals vulnerability patterns that cannot be inferred from grid-strength indicators.

[AI-66] Fully Dynamic Rebalancing in Dockless Bike-Sharing Systems via Deep Reinforcement Learning

链接: https://arxiv.org/abs/2605.14501
作者: Edoardo Scarpel,Alberto Pettena,Matteo Cederle,Federico Chiariotti,Marco Fabris,Gian Antonio Susto
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 5 figures, 1 table, accepted at the 23rd IFAC World Congress, Busan, South Korea, Aug. 23-26, 2026. Open invited track 9-131: “Control and Optimization for Smart Cities”

点击查看摘要

Abstract:This paper proposes a fully dynamic Deep Reinforcement Learning (DRL) method for rebalancing dockless bike-sharing systems, overcoming the limitations of periodic, system-wide interventions. We model the service through a graph-based simulator and cast rebalancing as a Markov decision process. A DRL agent routes a single truck in real time, executing localized pick-up, drop-off, and charging actions guided by spatiotemporal criticality scores. Experiments on real-world data show significant reductions in availability failures with a minimal fleet size, while limiting spatial inequality and mobility deserts. Our approach demonstrates the value of learning-based rebalancing for efficient and reliable shared micromobility.

[AI-67] ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization IJCAI2026

链接: https://arxiv.org/abs/2605.14497
作者: Letian Yang(1),Xu Liu(1),Yiqiang Lu(2),Jian Liu(2),Weiqiang Wang(2),Shuai Li(1) ((1) Shanghai Jiao Tong University, Shanghai, China, (2) Ant Group, Shanghai, China)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 pages, 9 figures, 7 tables. Accepted to IJCAI 2026

点击查看摘要

Abstract:Offline-to-online reinforcement learning harnesses the stability of offline pretraining and the flexibility of online fine-tuning. A key challenge lies in the non-stationary distribution shift between offline datasets and the evolving online policy. Common approaches often rely on static mixing ratios or heuristic-based replay strategies, which lack adaptability to different environments and varying training dynamics, resulting in suboptimal tradeoff between stability and asymptotic performance. In this work, we propose Reinforcement Learning with Optimized Adaptive Data-mixing (ROAD), a dynamic plug-and-play framework that automates the data replay process. We identify a fundamental objective misalignment in existing approaches. To tackle this, we formulate the data selection problem as a bi-level optimization process, interpreting the data mixing strategy as a meta-decision governing the policy performance (outer-level) during online fine-tuning, while the conventional Q-learning updates operate at the inner level. To make it tractable, we propose a practical algorithm using a multi-armed bandit mechanism. This is guided by a surrogate objective approximating the bi-level gradient, which simultaneously maintains offline priors and prevents value overestimation. Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

[AI-68] Contestable Multi-Agent Debate with Arena-based Argumentative Computation for Multimedia Verification ICMR2026

链接: https://arxiv.org/abs/2605.14495
作者: Truong Thanh Hung Nguyen,Vo Thanh Khang Nguyen,Hoang-Loc Cao,Phuc Ho,Van Pham,Hung Cao
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: ACM ICMR 2026 Grand Challenge on Multimedia Verification

点击查看摘要

Abstract:Multimedia verification requires not only accurate conclusions but also transparent and contestable reasoning. We propose a contestable multi-agent framework that integrates multimodal large language models, external verification tools, and arena-based quantitative bipolar argumentation (A-QBAF) as a submission to the ICMR 2026 Grand Challenge on Multimedia Verification. Our method decomposes each case into claim-centered sections, retrieves targeted evidence, and converts evidence into structured support and attack arguments with provenance and strength scores. These arguments are resolved through small local argument graphs with selective clash resolution and uncertainty-aware escalation. The resulting system generates section-wise verification reports that are transparent, editable, and computationally practical for real-world multimedia verification. Our implementation is public at: this https URL.

[AI-69] Learning Scenario Reduction for Two-Stage Robust Optimization with Discrete Uncertainty

链接: https://arxiv.org/abs/2605.14494
作者: Tianjue Lin,Jianan Zhou,Jieyi Bi,Yaoxin Wu,Wen Song,Zhiguang Cao,Jie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Two-Stage Robust Optimization (2RO) with discrete uncertainty is challenging, often rendering exact solutions prohibitive. Scenario reduction alleviates this issue by selecting a small, representative subset of scenarios to enable tractable computation. However, existing methods are largely problem-agnostic, operating solely on the uncertainty set without consulting the feasible region or recourse structure. In this paper, we introduce PRISE, a problem-driven sequential lookahead heuristic that constructs reduced scenario sets by evaluating the marginal impact of each scenario. While PRISE yields high-quality scenario subsets, each selection step requires solving multiple subproblems, making it computationally expensive at scale. To address this, we propose NeurPRISE, a neural surrogate model built on a GNN-Transformer backbone that encodes the per-scenario structure via graph convolution and captures cross-scenario interactions through attention. NeurPRISE is trained via imitation learning with a gain-aware ranking objective, which distills marginal gain information from PRISE into a learned scoring function for scenario ranking and selection. Extensive results on three 2RO problems show that NeurPRISE consistently achieves competitive regret relative to comprehensive methods, maintains strong calability with varying numbers of scenarios, and delivers 7-200x speedup over PRISE. NeurPRISE also exhibits strong zero-shot generalization, effectively handling instances with larger problem scales (up to 5x), more scenarios (up to 4x), and distribution shifts.

[AI-70] Deepchecks: Evaluating Retrieval-Augmented Generation (RAG )

链接: https://arxiv.org/abs/2605.14488
作者: Assaf Gerner,Netta Madvil,Nadav Barak,Alex Zaikman,Jonatan Liberman,Liron Hamra,Rotem Brazilay,Shay Tsadok,Yaron Friedman,Neal Harow,Noam Bresler,Shir Chorev,Philip Tannor,Lior Rokach
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) augmented with Retrieval-Augmented Generation (RAG) techniques are revolutionizing applications across multiple domains, such as healthcare, finance, and customer service. Despite their potential, evaluating RAG systems remains a complex challenge due to the stochastic nature of generated outputs and the intricate interplay between retrieval and generation components. This paper introduces Deepchecks, a comprehensive framework tailored for evaluating RAG applications. Deepchecks’ evaluation framework addresses RAG applications evaluation through a multi-faceted approach, root cause analysis and production monitoring. By ensuring alignment with application-specific requirements, Deepchecks framework provides a robust foundation for assessing reliability, relevance, and user satisfaction in RAG systems.

[AI-71] LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning NEURIPS2026

链接: https://arxiv.org/abs/2605.14483
作者: Xudong Chen,Yixin Liu,Hua Wei,Kaize Ding
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Submitted to Neurips 2026

点击查看摘要

Abstract:Large language models (LLMs) have become a strong foundation for multi-agent systems, but their effectiveness depends heavily on orchestration design. Across different tasks, role design, capacity assignment, and dependency construction jointly affect both solution quality and execution efficiency. Existing approaches automate parts of this design process, yet they often optimize these decisions partially or sequentially, and rely on execution-level feedback that provides limited credit assignment for local orchestration decisions. We propose LEMON (\textbfLearning \textbfExecutable \textbfMulti-agent \textbfOrchestratio\textbfN via Counterfactual Reinforcement Learning), an LLM-based orchestrator that generates an executable orchestration specification. The specification integrates task-specific roles, customized duties, capacity levels, and dependency structure into a single deployable system. To train the orchestrator, we augment the orchestration-level GRPO objective with a localized counterfactual signal that edits role, capacity, or dependency fields and applies the resulting reward contrast only to the edited spans. Experiments on six reasoning and coding benchmarks, including MMLU, GSM8K, AQuA, MultiArith, SVAMP, and HumanEval, show that LEMON achieves state-of-the-art performance among the evaluated multi-agent orchestration methods. Our code is available at this https URL.

[AI-72] From Table to Cell: Attention for Better Reasoning with TABALIGN

链接: https://arxiv.org/abs/2605.14465
作者: Tung Sum Thomas Kwok,Zeyong Zhang,Xinyu Wang,Chunhe Wang,Xiaofeng Lin,Hanwei Wu,Lei Ding,Guang Cheng,Zhijiang Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-step LLM reasoning over structured tables fails because planning and execution share no explicit cell-grounding contract. Existing methods constrain the planner to a left-to-right factorization at odds with table permutation invariance, and score intermediate states by generated content alone, overlooking cell grounding. We conduct a pilot study showing that diffusion language models (DLMs) produce more human-aligned and permutation-stable cell attention on tables than autoregressive models, with a 40.2% median reduction in attention-AUROC variability under row reordering. Motivated by this, we propose TABALIGN, a planned table reasoning framework that operationalizes the contract. TABALIGN pairs a masked DLM planner, whose bidirectional denoising emits plan steps as binary cell masks, with TABATTN, a lightweight verifier trained on 1,600 human-verified attention standards to score each step by its attention overlap with the plan-designated mask. Across eight benchmarks covering table question answering and fact verification, TABALIGN improves average accuracy by 15.76 percentage points over the strongest open-source baseline at comparable 8B-class scale, with a matched-backbone ablation attributing 2.87 percentage points of this gain to the DLM planner over an AR planner on a fixed reasoner. Cleaner DLM plans also accelerate downstream reasoning execution by 44.64%.

[AI-73] OmniDrop: Layer-wise Token Pruning for Omni-modal LLM s via Query-Guidance

链接: https://arxiv.org/abs/2605.14458
作者: Yeo Jeong Park,Hyemi Jang,Minseo Choi,Jongsun Lee,Jooyoung Choi,Yongkweon Jeon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Omni-modal large language models have demonstrated remarkable potential in holistic multimodal understanding; however, the token explosion caused by high-resolution audio and video inputs remains a critical bottleneck for real-time applications and long-form reasoning. Existing omni-modal token compression methods typically prune tokens at the input embedding level, relying on audio-video similarity or temporal co-occurrence as proxies for semantic relevance. In practice, such assumptions are often unreliable. To address this limitation, we propose OmniDrop, a training-free, layer-wise token pruning framework that progressively prunes audiovisual tokens within the LLM decoder layers rather than at the input-level, allowing early layers to preserve sufficient omni-modal information fusion before aggressively removing tokens in deeper layers. We further utilize text queries as guidance for modality-agnostic and task-adaptive token pruning. We also introduce a temporal diversity score that encourages balanced token survival to preserve global temporal context. Experimental results across various audiovisual benchmarks demonstrate that OmniDrop outperforms all baselines by up to 3.58 points while reducing prefill latency by up to 40% and memory usage by up to 14.7%.

[AI-74] Stateful Reasoning via Insight Replay

链接: https://arxiv.org/abs/2605.14457
作者: Bin Lei,Caiwen Ding,Jiachen Yang,Ang Li,Xin Eric Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model’s attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbfInsightReplay, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a \mathbf2!\times!\mathbf3!\times!\mathbf4 benchmark grid, covering model scales \text8B, \text30B\ , model families \textQwen3.5, \textDeepSeek-R1-Distill-Qwen, \textGemma-4\ , and reasoning benchmarks \textAIME, \textHMMT, \textGPQA Diamond, \textLiveCodeBench v5\ , show that 3-round InsightReplay yields accuracy gains across \textbfall 24 settings, with an averaged improvement of \mathbf+1.65 points over standard CoT, and a largest single-setting gain of \mathbf+9.2 points on R1-Distill-32B’s LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

[AI-75] Intelligence Impact Quotient (IIQ): A Framework for Measuring Organizational AI Impact

链接: https://arxiv.org/abs/2605.14455
作者: Chandan Rajah,Neha Sengupta,Federico Castanedo,Robin Mills,Amit Bahree,Ramesh Krishnan Muthukrishnan,Larry Murray
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The Intelligence Impact Quotient (IIQ) is a composite metric intended to quantify the depth to which AI systems are integrated into organizational work and their impact. Rather than treating access counts or aggregate token volume as sufficient evidence of impact, IIQ combines a novelty-weighted, time-decayed token stock with usage frequency, a grace-period recency gate, organizational leverage, task complexity, and autonomy. The formulation produces a raw Intelligence Adoption Index (IAI) and a normalized 0-1000 IIQ index for comparison between heterogeneous users and units. We also derive sub-daily update rules and a bounded interpretation layer for estimated efficiency and financial impact. The paper positions IIQ as a deployment-oriented measurement framework: a formal proposal for tracking AI embedding in workflows, not a direct measure of model capability or a substitute for causal productivity evaluation. Synthetic scenarios illustrate how the revised metric distinguishes between frequent low-leverage use, semantically repetitive prompting, and more autonomous, higher-consequence AI-assisted work.

[AI-76] Synthesizing POMDP Policies: Sampling Meets Model-checking via Learning

链接: https://arxiv.org/abs/2605.14440
作者: Debraj Chakraborty,Anirban Majumdar,Prince Mathew,Sayan Mukherjee,Jean-François Raskin
机构: 未知
类目: Artificial Intelligence (cs.AI); Formal Languages and Automata Theory (cs.FL); Logic in Computer Science (cs.LO)
备注: Paper accepted at 38th International Conference on Computer Aided Verification (CAV 2026), Lisbon, Portugal, July 2026

点击查看摘要

Abstract:Partially Observable Markov Decision Processes (POMDPs) are the standard framework for decision-making under uncertainty. While sampling-based methods scale well, they lack formal correctness guarantees, making them unsuitable for safety-critical applications. Conversely, formal synthesis techniques provide correctness-by-construction but often struggle with scalability, as general POMDP synthesis is undecidable. To bridge this gap, we propose a synthesis framework that integrates sampling, automata learning, and model-checking. Inspired by Angluin’s L^* algorithm, our approach utilizes sampling as a membership oracle and model-checking as an equivalence oracle. This enables the synthesis of finite-state controllers with formal guarantees, provided the sampling-induced policy is regular. We establish a relative completeness result for this framework. Experimental results from our prototypical implementation demonstrate that this method successfully solves threshold-safety problems that remain challenging for existing formal synthesis tools. We believe our algorithm serves as a valuable component in a portfolio approach to tackling the inherent difficulty of POMDP synthesis problems.

[AI-77] BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

链接: https://arxiv.org/abs/2605.14438
作者: Juntong Wu,Jialiang Cheng,Qishen Yin,Yue Dai,Yuliang Yan,Fuyu Lv,Ou Dan,Li Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures

点击查看摘要

Abstract:Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98% of the original model’s performance while reducing MoE layer FLOPs by up to 85%, achieving up to 2.5 \times faster decoding and 1.4 \times higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.

[AI-78] Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic

链接: https://arxiv.org/abs/2605.14423
作者: Leo Muxing Wang,Pengkun Yang,Lili Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of \tilde\mathcalO(1/((1-\gamma)^4\sqrtTK)) , and the policy gradient norm converges to zero at the rate of \tilde\mathcalO(1/((1-\gamma)^6\sqrtTK)) , where T is the number of rounds, K is the number of agents, and \gamma\in (0,1) is the discount factor. These results demonstrate linear speedup with respect to the number of agents K , despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \textttHopper-v5 action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.

[AI-79] MemLineage: Lineage-Guided Enforcement for LLM Agent Memory

链接: https://arxiv.org/abs/2605.14421
作者: Ciyan Ouyang,Rui Hou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 24 pages, 8 figures. Rui Hou is the corresponding author

点击查看摘要

Abstract:We introduce MemLineage, a defense for LLM agent memory that attaches both cryptographic provenance and LLM-mediated derivation lineage to every entry. Recent and concurrent work shows that untrusted content can be written into persistent agent state and re-enter later sessions as an instruction; the remaining systems question is how to preserve useful memory recall while preventing such state from justifying sensitive actions. MemLineage treats this as a chain-of-custody problem rather than a filtering problem. It is a six-module design around an RFC-6962 Merkle log over per-principal Ed25519-signed entries: a weighted derivation DAG records which retrieved entries influenced each new memory, and a max-of-strong-edges propagation rule makes Untrusted-Path Persistence hold for any chain whose attribution edges remain above threshold. The sensitive-action gate then refuses dispatches whose active justification descends from an external ancestor, while still allowing benign recall. We evaluate three defense cells against three memory-poisoning workloads on a deterministic mechanism-isolation harness; MemLineage is the only configuration in that harness that drives all three columns to zero ASR, while sub-millisecond per-operation overhead keeps it well below the noise floor of any LLM call. A Codex-backed AgentDojo bridge further separates strong-model behavior from defense-layer behavior: under an intentionally vulnerable tool-output profile, no-defense and signature-only baselines fail on all six banking pairs, while all MemLineage rows reduce strict AgentDojo ASR to zero. The core deterministic artifacts are byte-equal CI-verified; hosted-model AgentDojo and live-model sweeps are recorded as auditable logs rather than byte-pinned artifacts.

[AI-80] DVMap: Fine-Grained Pluralistic Value Alignment via High-Consensus Demographic-Value Mapping ACL2026

链接: https://arxiv.org/abs/2605.14420
作者: Pengyun Zhu,Yuqi Ren,Zhen Wang,Lei Yang,Deyi Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to the Main Conference of ACL 2026

点击查看摘要

Abstract:Current Large Language Models (LLMs) typically rely on coarse-grained national labels for pluralistic value alignment. However, such macro-level supervision often obscures intra-country value heterogeneity, yielding a loose alignment. We argue that resolving this limitation requires shifting from national labels to multi-dimensional demographic constraints, which can identify groups with predictable, high-consensus value preference. To this end, we propose DVMap (High-Consensus Demographic-Value Mapping), a framework for fine-grained pluralistic value alignment. In this framework, we first present a demographic archetype extraction strategy to construct a high-quality value alignment corpus of 56,152 samples from the World Values Survey (WVS) by strictly retaining respondents with consistent value preferences under identical demographics. Over this corpus, we introduce a Structured Chain-of-Thought (CoT) mechanism that explicitly guides LLMs to reason about demographic-value correlations. Subsequently, we employ Group Relative Policy Optimization (GRPO) to achieve adaptive anchoring of value distributions. To rigorously evaluate generalization, we further establish a triple-generalization benchmark (spanning cross-demographic, cross-country, and cross-value) comprising 21,553 samples. Experimental results demonstrate that DVMap effectively learns the manifold mapping from demographics to values, exhibiting strong generalization and robustness. On cross-demographic tests, Qwen3-8B-DVMap achieves 48.6% accuracy, surpassing the advanced open-source LLM DeepSeek-v3.2 (45.1%). The source code and dataset are available at this https URL.

[AI-81] he Great Pretender: A Stochasticity Problem in LLM Jailbreak

链接: https://arxiv.org/abs/2605.14418
作者: Jean-Philippe Monteuuis,Cong Chen,Jonathan Petit
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:“Oh-Oh, yes, I’m the great pretender. Pretending that I’m doing well. My need is such, I pretend too much…” summarizes the state in the area of jailbreak creation and evaluation. You find this method to generate adversarial attacks proposed by a reputable institution (e.g., BoN from Anthropic or Crescendo from Microsoft Research). However, this method does not deliver on the promise claimed in the paper despite having top ASR scores against industry-grade LLMs. You successfully generate the jailbreak prompts against your target (open) model. However, the generated jailbreak prompt works against the target model with a 50% consecutive success rate (5 out of 10 attempts) despite having an 80% ASR (on paper) on the latest closed-source model (with a guardrail system)! This observation leads us to think. First, Attack Success Rate (ASR), the primary metric for LLM jailbreak benchmarking, is not a stable quantity. Second, published ASR numbers are therefore systematically inflated and incomparable across papers. Therefore, we wonder “Why a successful jailbreak prompt does not perform consistently well against a target model on which the prompts have been optimized?”. To answer this question, we study the impact of stochasticity not only during attack evaluation but also during attack generation. Our evaluation includes several jailbreak attacks, models (different sizes and providers), and judges. In addition, we propose a new metric and two new frameworks (CAS-eval and CAS-gen). Our evaluation framework, CAS-eval, shows that an attack can have an ASR drop of up to 30 percentage points when a jailbreak prompt needs to succeed on more than one attempt. Thankfully, our attack generation framework (CAS-gen) improves previous jailbreak methods and helps them recover this loss of 30 percentage points!

[AI-82] A Unified Knowledge Embedded Reinforcement Learning-based Framework for Generalized Capacitated Vehicle Routing Problems

链接: https://arxiv.org/abs/2605.14416
作者: Wen Wang,Xiangchen Wu,Liang Wang,Hao Hu,Xianping Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Capacitated Vehicle Routing Problem (CVRP) is a fundamental NP-hard problem with broad applications in logistics and transportation. Real-world CVRPs often involve diverse objectives and complex constraints, such as time windows or backhaul requirements, motivating the development of a unified solution framework. Recent reinforcement learning (RL) approaches have shown promise in combinatorial optimization, yet they rely on end-to-end learning and lack explicit problem-solving knowledge, limiting solution quality. In this paper, we propose a knowledge-embedded framework inspired by the Route-First Cluster-Second heuristics. It incorporates knowledge at two levels: (1) decomposing CVRPs into the route-first and cluster-second subproblems, and (2) leveraging dynamic programming to solve the second subproblem, whose results guide the RL-based constructive solver to solve the first problem. To mitigate partial observability caused by problem decomposition, we introduce a unified history-enhanced context processing module. Extensive experiments show that this framework achieves superior solution quality compared with state-of-the-art learning-based methods, with a smaller gap to classical heuristics, demonstrating strong generalization across diverse CVRP variants.

[AI-83] MahaVar: OOD Detection via Class-wise Mahalanobis Distance Variance under Neural Collapse

链接: https://arxiv.org/abs/2605.14413
作者: Donghwan Kim,Hyunsoo Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 8 figures

点击查看摘要

Abstract:Out-of-distribution (OOD) detection is a critical component for ensuring the reliability of deep neural networks in safety-critical applications. In this work, we present a key empirical observation: for in-distribution (ID) samples, class-wise Mahalanobis distances exhibit a pronounced sharp minimum structure, where the distance to the nearest class is small while distances to all other classes remain large, resulting in high variance across classes. In contrast, OOD samples tend to exhibit a less pronounced sharp minimum structure, producing comparatively lower variance across classes. We further provide a theoretical analysis grounding this observation in Neural Collapse geometry: under relaxed Neural Collapse assumptions on within-class compactness and inter-class separation, ID samples are shown to structurally exhibit high class-wise distance variance, offering a theoretical basis for its use as an OOD score. Motivated by this observation and its theoretical backing, we propose MahaVar, a simple and effective post-hoc OOD detector that augments the Mahalanobis distance with a class-wise distance variance term. Following the OpenOOD v1.5 benchmark protocol, MahaVar achieves state-of-the-art performance on CIFAR-100 and ImageNet, with consistent improvements in both AUROC and FPR@95 over existing Mahalanobis-based methods across all benchmarks.

[AI-84] Energy-Efficient Quadruped Locomotion with Compliant Feet

链接: https://arxiv.org/abs/2605.14411
作者: Pramod Pal(1),Shishir Kolathaya(2),Ashitava Ghosal(1 and 3) ((1) Indian Institute of Science Bangalore India, (2) Robert Bosch Centre for Cyber Physical Systems Bangalore India, (3) Ahmedabad University Ahmedabad India)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 29 pages, 7 figures, supplemental videos link is mentioned in the paper

点击查看摘要

Abstract:Quadruped robots are often designed with rigid feet to simplify control and maintain stable contact during locomotion. While this approach is straightforward, it limits the ability of the legs to absorb impact forces and reuse stored elastic energy, leading to higher energy expenditure during locomotion. To explore whether compliant feet can provide an advantage, we integrate foot compliance into a reinforcement learning (RL) locomotion controller and study its effect on walking efficiency. In simulation, we train eight policies corresponding to eight different spring stiffness values and then cross-evaluate their performance by measuring mechanical energy consumed per meter traveled. In experiments done on a developed quadruped, the energy consumption for the intermediate stiffness spring is lower by ~ 17% when compared to a very stiff or a very flexible spring incorporated in the feet, with similar trends appearing in the simulation results. These results indicate that selecting an appropriate foot compliance can improve locomotion efficiency without destabilizing the robot during motion.

[AI-85] Metis AI: The Overlooked Middle Zone Between AI-Native and World-Movers

链接: https://arxiv.org/abs/2605.14407
作者: Xiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The dominant discourse on AI limitations frames the boundary of AI capability as a divide between digital tasks (where AI excels) and physical tasks (where embodiment is required). We argue this framing misses the most consequential boundary: the one within digital tasks. We identify a class of tasks we call Metis AI, named for the Greek concept of metis (practical, contextual knowledge), that are performed entirely on computers yet resist reliable AI automation. These tasks are not computationally intractable; they are institutionally, socially, and normatively entangled in ways that defeat algorithmic approaches. We distinguish constitutive metis (knowledge destroyed by the act of formalization) from operational metis (system-specific familiarity that automation can progressively absorb), and propose five structural characteristics that define the Metis AI zone: consequential irreversibility, relational irreducibility, normative open texture, adversarial co-evolution, and accountability anchoring. We ground each in established theory from across the social sciences, philosophy, and humanitarian practice, argue that these characteristics are properties of the tasks themselves rather than limitations of current models, and show that the appropriate design response is not better automation but centaur architectures in which humans lead and AI supports.

[AI-86] Coding Agent Is Good As World Simulator

链接: https://arxiv.org/abs/2605.14398
作者: Hongyu Wang,Jingquan Wang,Bocheng Zou,Radu Serban,Dan Negrut
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints. As a result, the generated video rollouts are not physically plausible, exhibiting unstable contacts, distorted shapes, or inconsistent motion. In this paper, we present an agentic framework constructing physics-based world models through executable simulation code. The framework coordinates planning, code generation, visual review, and physics analysis agents. The planning agent converts the natural language prompt into a structured scene plan, the code agent implements it as executable simulation code, and the visual review agent provide visual feedback while the physics analysis agent checks physical consistency. The code is iteratively revised based on the feedback until the simulation matches the prompt reqirements and physical constraints. Experimental results show that our framework outperforms advanced video-based models in physical accuracy, instruction fidelity and visual quality, which could be applied to various scenarios including driving simulation and embodied robot tasks.

[AI-87] Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

链接: https://arxiv.org/abs/2605.14392
作者: Yucheng Shi,Zhenwen Liang,Kishan Panaganti,Dian Yu,Wenhao Yu,Haitao Mi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Tech report, work in progress

点击查看摘要

Abstract:We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve–verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others are intrinsically hard to solve but easy to verify, like planted subset-sum or constraint satisfaction. Both create a durable gap between proposing and solving that the policy cannot close by gaming the verifier, and it is this gap that keeps reward informative as the learner improves. We instantiate this view in EvoEnv, a single-policy generator, solver method that synthesizes Python environments from ten seeds and admits them only after staged validation, semantic self-review, solver-relative difficulty calibration, and novelty checks. The strongest evidence comes from the already-strong regime: on Qwen3-4B-Thinking, fixed public-data RLVR and fixed hand-crafted environment RLVR reduce the average, while EvoEnv improves it from 72.4 to 74.8, a relative gain of 3.3%. Stable self-improvement, we suggest, depends not on producing more synthetic data, but on models learning to construct worlds whose difficulty stays structurally beyond their own reach.

[AI-88] Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning NEURIPS2026

链接: https://arxiv.org/abs/2605.14386
作者: Taebong Kim,Youngsik Hong,Minsik Kim,Sunyoung Choi,Jaewon Jang,Junghoon Shin,Minseo Kim
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: NeurIPS 2026 submission. 18 pages including appendix

点击查看摘要

Abstract:We present Darwin Family, a framework for training-free evolutionary merging of large language models via gradient-free weight-space recombination. We ask whether frontier-level reasoning performance can be improved without additional training, by reorganizing latent capabilities already encoded in existing checkpoints. Darwin introduces three key ideas: (i) a 14-dimensional adaptive merge genome enabling fine-grained component- and block-level recombination; (ii) MRI-Trust Fusion, which adaptively balances diagnostic layer-importance signals with evolutionary search through a learnable trust parameter; and (iii) an Architecture Mapper that enables cross-architecture breeding between heterogeneous model families. Empirically, the flagship Darwin-27B-Opus achieves 86.9% on GPQA Diamond, ranking #6 among 1,252 evaluated models, and outperforming its fully trained foundation model without any gradient-based training. Across scales from 4B to 35B parameters, Darwin models consistently improve over their parents, support recursive multi-generation evolution, and enable a training-free evolutionary merge that combines Transformer- and Mamba-based components. Together, the Darwin Family demonstrates that diagnostic-guided evolutionary merging is a practical and reproducible alternative to costly post-training pipelines for reasoning-centric language models.

[AI-89] Optimal Pattern Detection Tree for Symbolic Rule-Based Classification

链接: https://arxiv.org/abs/2605.14374
作者: Young-Chae Hong,Yangho Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注: Published in Transactions on Machine Learning Research (TMLR). 26 pages, 4 figures. OpenReview URL: this https URL

点击查看摘要

Abstract:Pattern discovery in data plays a crucial role across diverse domains, including healthcare, risk assessment, and machinery maintenance. In contrast to black-box deep learning models, symbolic rule discovery emerges as a key data mining task, generating human-interpretable rules that offer both transparency and intuitive explainability. This paper introduces the Optimal Pattern Detection Tree (OPDT), a rule-based machine learning model based on novel mixed-integer programming to discover a single optimal pattern in data through binary classification. To incorporate prior knowledge and compliance requirements, we further introduce the Branching Structure Constraints (BSC) framework, which enables decision makers to encode domain knowledge and constraints directly into the model. This optimization-based approach discovers a hidden underlying pattern in datasets, when it exists, by identifying an optimal rule that maximizes coverage while minimizing the false positive rate due to misclassification. Our computational experiments show that OPDT discovers a pattern with optimality guarantees on moderately sized datasets within reasonable runtime.

[AI-90] urning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization ICML2026

链接: https://arxiv.org/abs/2605.14373
作者: Chen Liang,Xiatao Sun,Qian Wang,Daniel Rakita
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

点击查看摘要

Abstract:Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,‘’ effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables O(1) query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

[AI-91] LoMETab: Beyond Rank-1 Ensembles for Tabular Deep Learning

链接: https://arxiv.org/abs/2605.14365
作者: Changryeol Choi,Hyewon Park,Yujin Kwon,Gowun Jeong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent tabular learning benchmarks increasingly show a tight performance cluster rather than a clear hierarchy among leading methods, spanning gradient boosted decision trees, attention-based architectures, and implicit ensembles such as TabM. As benchmark gains plateau, a complementary goal is to understand and control the mechanisms that make simple neural tabular models competitive. We propose LoMETab, a rank- r generalization of multiplicative implicit ensembles. LoMETab lifts the rank-1 BatchEnsemble/TabM modulation to a rank- r identity-residual Hadamard family by parameterizing each member weight as W_k = W \odot (1 + A_kB_k^\top) , where W is shared and (A_k, B_k) are member-specific low-rank factors. This exposes two practical diversity-control axes: the adapter rank r and the initialization scale \sigma_\mathrminit , and we prove that for r \ge 2 this generalization strictly enlarges BatchEnsemble’s hypothesis class. Empirically, we show that this added capacity manifests as measurable predictive diversity after training: on representative classification datasets, LoMETab sustains higher pairwise KL than an additive low-rank ablation, and (r, \sigma_\mathrminit) provides broad control over pairwise KL, varying by up to several orders of magnitude across configurations. The induced diversity is reflected in task-appropriate output-level measures: argmax disagreement for classification and ambiguity for regression, indicating that the control extends beyond pairwise KL to decision- and output-level member variation. Finally, experiments sweeping over adapter rank r and initialization scale \sigma_\mathrminit reveal that predictive performance is dataset-dependent over the (r, \sigma_\mathrminit) grid, supporting LoMETab as a controllable family of implicit ensembles rather than a fixed rank-1 construction.

[AI-92] Correctness-Aware Repository Filtering Under Maximum Effective Context Window Constraints

链接: https://arxiv.org/abs/2605.14362
作者: Shweta Mishra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Context window efficiency is a practical constraint in large language model (LLM)-based developer tools. Paulsen [12] shows that all tested models degrade in accuracy well before their advertised context limits the Maximum Effective Context Window (MECW) which makes context construction a quality problem, not just a cost one. Modern software repositories routinely contain large non-code artifacts compiled datasets, binary model weights, minified JavaScript bundles, and gigabyte-scale log files that overflow the context window and push out task-relevant source code. We present a correctness-aware context hygiene framework: a pre-execution, size-based heuristic filter that intercepts repository scans before tokenization, using only OS-level stat() metadata with sub-millisecond overhead. Semantic retrieval approaches such as RepoCoder, GraphRAG, and AST-based chunking require index construction and query-time inference before any filtering decision is reached. Our framework, by contrast, requires no indexing and operates at 0.01 ms per file decision. Across 10 real open-source repositories (22,046 files, 5 languages), the proposed SizeFilter at \theta=1 MB achieves 79.6% (\pm13.2%) mean token reduction at 0.30 ms overhead: the HybridFilter achieves 89.3% (\pm9.0%) the lowest variance of any filter evaluated. A token-density study across 2,688 files confirms a strong linear correlation (Pearson r=0.997, k=0.250 tokens/byte). A limited-scope evaluation (18 tasks, CodeLlama-7B-Instruct) yields 72% file-level accuracy under filtering versus 25% at baseline; hallucination frequency declines from 61% to 17%. All code and data are released for reproducibility.

[AI-93] RQ-MoE: Residual Quantization via Mixture of Experts for Efficient Input-Dependent Vector Compression ICML2026

链接: https://arxiv.org/abs/2605.14359
作者: Zhengjia Zhong,Shuyan Ke,Zaizhou Lin,Jiaqi Song,Hongyi Lan,Hui Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: To appear at ICML 2026

点击查看摘要

Abstract:Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at this https URL.

[AI-94] Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

链接: https://arxiv.org/abs/2605.14358
作者: Sanjoy Chowdhury,Dinesh Manocha
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Language models often generate long chain-of-thought traces, but it remains unclear how much of this reasoning is necessary for preserving the final prediction. We study this through the lens of overcomplete reasoning traces: generated traces that contain more intermediate steps than are needed to support the model’s answer. We define the minimal core as the smallest subset of steps that preserves either the final answer or predictive distribution, and introduce metrics for compression ratio, redundancy mass, step necessity, and necessity concentration. Across six deliberative reasoning benchmarks spanning arithmetic, competition mathematics, expert scientific reasoning, and commonsense multi-hop QA, we find substantial overcompleteness: on average, 46% of steps are removable under greedy minimal-core extraction while preserving the original answer in 86% of cases. We also find that predictive support is concentrated: the top three steps account for 65% of measured necessity mass on average. Beyond compression, minimal cores expose a cleaner geometry of reasoning: compared with full traces, they improve correct-incorrect trace separation by 11 points, reduce estimated intrinsic dimensionality by 34%, and transfer across model families with 85% off-diagonal answer retention. Theoretically, we establish existence of minimal sufficient subsets, local irreducibility guarantees for greedy elimination, and certificates of overcompleteness and sparse necessity. Together, these results suggest that full reasoning traces are often verbose and overcomplete, while minimal cores isolate the effective support underlying language-model predictions.

[AI-95] CrystalReason er: Reasoning and RL for Property-Conditioned Crystal Structure Generation

链接: https://arxiv.org/abs/2605.14344
作者: Yuyang Wu,Stefano Falletta,Delia McGrath,Sherry Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Generative modeling has emerged as a promising approach for crystal structure discovery. However, existing LLM-based generative models struggle with low-level atomic precision, while diffusion-based methods fall short in integrating high-level scientific knowledge. As a result, generated structures are often invalid, unstable, or do not possess desirable properties. To address this gap, we propose CrystalReasoner (\method), an end-to-end LLM framework that generates crystal structures from natural language instructions through reasoning and alignment. \method introduces physical priors as thinking tokens, which include crystallographic symmetry, local coordination environments and predicted physical properties before generating atomic coordinates. This bridges the gap between natural language and 3D structures. \method then employs reinforcement learning (RL) with a multi-objective, dense reward function to align generation with physical validity, chemical consistency, and thermodynamic stability. For property-conditioned tasks, we design task-specific reward functions and train specialized models for discrete constraints (e.g., space group) and continuous properties (e.g., elasticity, thermal expansion). Empirical results demonstrate that compared to prior works and baselines without thinking traces or RL, \method obtains better performance on diverse metrics, triples S.U.N. ratio, and achieves better performance for property conditioned generation. \method also exhibits adaptive reasoning, increasing reasoning lengths as the number of atoms increases. Our work demonstrates the potential of leveraging thinking traces and RL for generating valid, stable, and property-conditioned crystal structures. Please see our work at this https URL .

[AI-96] AIM-DDI: A Model-Agnostic Multimodal Integration Module for Drug-Drug Interaction Prediction

链接: https://arxiv.org/abs/2605.14327
作者: Yerin Park,Sangseon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Drug-drug interaction (DDI) prediction is a critical task in computational biomedicine, as adverse interactions between co-administered drugs can cause severe side effects and clinical risks. A key challenge is unseen-drug generalization, where interactions must be predicted for drugs not observed during training. Although multimodal DDI models exploit diverse drug-related information, their fusion mechanisms are often tied to specific prediction architectures, limiting their reuse across models. To address this, we propose AIM-DDI, an architecture-independent multimodal integration module that represents heterogeneous modality information as tokens in a shared latent space. By modeling dependencies across modality tokens through a unified fusion module, AIM-DDI enables model-agnostic integration of structural, chemical, and semantic drug signals across different DDI prediction architectures. Extensive evaluations across diverse DDI models and DrugBank-based settings show that AIM-DDI consistently improves prediction performance, with the strongest gains under the most challenging both-unseen setting where neither drug in a test pair is observed during training. These results suggest that treating multimodal integration as a reusable module, rather than a model-specific fusion component, is an effective strategy for robust unseen-drug DDI prediction.

[AI-97] Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

链接: https://arxiv.org/abs/2605.14322
作者: Zixin Chen,Peng Liu,Rui Sheng,Haobo Li,Jianhong Tu,Xiaodong Deng,Kashun Shum,Dayiheng Liu,Huamin Qu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

[AI-98] Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems ICIP

链接: https://arxiv.org/abs/2605.14318
作者: Emilio Mastriani,Alessandro Costa,Federico Incardona,Kevin Munari,Sebastiano Spinello
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 7 figures. Under review at Neural Computing and Applications. Keywords: semantic segmentation, change point detection, fault anticipation

点击查看摘要

Abstract:Predictive maintenance in complex systems is often complicated by the heterogeneity and redundancy of monitored variables,which can obscure fault-relevant information and reduce model interpretability. This work proposes a semantic feature segmentation framework that decomposes the monitored feature space into a canonical component,expected to retain the dominant predictive information, and a residual component containing structurally peripheral signals. The segmentation is defined through domain informed criteria and sets up monitoring variables into functional groups reflecting operational mechanisms such as throughput,latency,pressure,network activity,and structural state. To evaluate the effectiveness of this decomposition, we adopt a predictive perspective in which expected predictive risk is used as an operational proxy for task-relevant information. Experimental results obtained through time-aware cross-validation show that the canonical space consistently achieves lower predictive risk than the residual space across multiple temporal configurations, indicating that the semantic segmentation concentrates the most relevant information for fault anticipation. In addition, the canonical segments exhibit significantly stronger intra-segment coherence than inter-segment dependence, and this structural organization remains stable after redundancy reduction. When compared with the full feature space and with a Principal Component Analysis (PCA) representation, the canonical space carries out comparable predictive performance and furthermore preserves the semantic meaning of the original variables. These findings suggest that semantic feature segmentation provides an interpretable and information-preserving decomposition of monitoring signals, enabling competitive predictive performance without sacrificing the operational interpretability required in predictive maintenance applications.

[AI-99] Matrix-Space Reinforcement Learning for Reusing Local Transition Geometry

链接: https://arxiv.org/abs/2605.14304
作者: Zuyuan Zhang,Carlee Joe-Wong,Tian Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Compositional generalization in sequential decision-making requires identifying which parts of prior rollouts remain useful for new tasks. Existing methods reuse skills or predictive models, but often overlook rich local transition geometry and dynamics. We propose Matrix-Space Reinforcement Learning (MSRL), a geometric abstraction that represents trajectory segments through positive semidefinite matrix descriptors aggregating first- and second-order statistics of lifted one-step transitions. These descriptors expose shared hidden structure, support algebraic composition in an abstract matrix space, and reveal opportunities for transfer. We prove that the descriptor is well defined up to coordinate gauge, complete for the induced low-order additive signal class, additive under valid segment composition, and minimally sufficient among admissible additive descriptors. We further show that conditioning value functions on the trajectory-segment matrix yields a first-order smooth approximation of action values, enabling source-learned matrix-to-value mappings to bootstrap learning in new tasks. MSRL is plug-in compatible with standard model-free and model-based methods, while obstruction filtering rejects implausible compositions. Empirically, MSRL achieves the best average finite-budget target AUC of 0.73, outperforming MSRL from scratch (0.65), TD-MPC-PT+FT (0.63), and TD-MPC (0.57).

[AI-100] Policy Optimization in Hybrid Discrete-Continuous Action Spaces via Mixed Gradients

链接: https://arxiv.org/abs/2605.14297
作者: Matias Alvo,Daniel Russo,Yash Kanoria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We study reinforcement learning in hybrid discrete-continuous action spaces, such as settings where the discrete component selects a regime (or index) and the continuous component optimizes within it – a structure common in robotics, control, and operations problems. Standard model-free policy gradient methods rely on score-function (SF) estimators and suffer from severe credit-assignment issues in high-dimensional settings, leading to poor gradient quality. On the other hand, differentiable simulation largely sidesteps these issues by backpropagating through a simulator, but the presence of discrete actions or non-smooth dynamics yields biased or uninformative gradients. To address this, we propose Hybrid Policy Optimization (HPO), which backpropagates through the simulator wherever smoothness permits, using a mixed gradient estimator that combines pathwise and SF gradients while maintaining unbiasedness. We also show how problems with action discontinuities can be reformulated in hybrid form, further broadening its applicability. Empirically, HPO substantially outperforms PPO on inventory control and switched linear-quadratic regulator problems, with performance gaps increasing as the continuous action dimension grows. Finally, we characterize the structure of the mixed gradient, showing that its cross term – which captures how continuous actions influence future discrete decisions – becomes negligible near a discrete best response, thereby enabling approximate decentralized updates of the continuous and discrete components and reducing variance near optimality. All resources are available at this http URL.

[AI-101] Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

链接: https://arxiv.org/abs/2605.14294
作者: Hengjie Liu,Zhenya Zhang,Jianjun Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 6 figures, the full version of the paper accepted by CAV 2026

点击查看摘要

Abstract:Formal verification of transformers has become increasingly important due to their widespread deployment in safety-critical applications. Compared to classic neural networks, the inferences of transformers involve highly complex computations, such as dot products in self-attention layers, rendering their verification extremely difficult. Existing approaches explored over-approximation methods by constructing convex constraints to bound the output ranges of transformers, which can achieve high efficiency. However, they may sacrifice verification precision, and consequently introduce significant approximation error that leads to frequent occurrences of false alarms. In this paper, we propose a transformer verification approach that can achieve improved precision. At the core of our approach is a novel usage of ReLU, by which we represent a precise but non-linear bound for dot products such that we can further exploit the rich body of literature for convex relaxation of ReLU to derive precise bounds. We extend two classic approaches to the context of transformers, a rule-based one and an optimization-based one, resulting in two new frameworks for efficient and precise verification. We evaluate our approaches on different model architectures and robustness properties derived from two datasets about sentiment analysis, and compare with the state-of-the-art baseline approach. Compared to the baseline, our approach can achieve significant precision improvement for most of the verification tasks with acceptable compromise of efficiency, which demonstrates the effectiveness of our approach.

[AI-102] Watermarking Game-Playing Agents in Perfect-Information Extensive-Form Games

链接: https://arxiv.org/abs/2605.14283
作者: Juho Kim,Fei Fang,Tuomas Sandholm
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Watermarking techniques for large language models (LLMs), which encode hidden information in the output so its source can be verified, have gained significant attention in recent days, thanks to their potential capability to detect accidental or deliberate misuse. Similar challenges involving model misuse also exist in the context of game-playing, such as when detecting the unauthorized use of AI tools in gaming platforms (e.g., cheating in online chess). In this paper, we initiate the study of how game-playing strategies can be watermarked. We show how the KGW watermark for LLMs can be adapted to watermark game-playing agents in perfect-information extensive-form games. The watermark can then be detected using a statistical test. We show that the degradation in the quality of the watermarked strategy profile, quantified by the expected utility, can be bounded, but there is a tradeoff between detectability and quality. In our experiments, we bootstrap the watermarking framework to various chess engines and demonstrate that a) the impact of the watermark on the quality of the strategy is negligible and b) the watermark can be detected with just a handful of games.

[AI-103] Parallelizing Counterfactual Regret Minimization

链接: https://arxiv.org/abs/2605.14277
作者: Juho Kim,Tuomas Sandholm
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注: This paper contains and extends ideas that were originally in arXiv:2408.14778

点击查看摘要

Abstract:Parallelization has played an instrumental role in the field of artificial intelligence (AI), drastically reducing the time taken to train and evaluate large AI models. In contrast to its impact in the broader field of AI, applying parallelization to computational game solving is relatively unexplored, despite its great potential. In this paper, we parallelize the family of counterfactual regret minimization (CFR) algorithms, which were central to important breakthroughs for solving large imperfect-information games. We present a generalized parallelization framework, reframing CFR as a series of linear algebra operations. Then, existing techniques for parallelizing linear algebra operations can be applied to accelerate CFR. We also describe how our technique can be applied to other tabular members of the CFR family of algorithms, including the state-of-the-art, such as CFR+, discounted CFR, and predictive variants of CFR. Experimentally, we show that our CFR implementation on a GPU is up to four orders of magnitude faster than Google DeepMind OpenSpiel’s CFR implementations on a CPU.

[AI-104] Agent ic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive Agent ic Multi-Agent AI Framework for Learning Teaching and Institutional Intelligence

链接: https://arxiv.org/abs/2605.14266
作者: Vidya K Sudarshan,Anushka Sisodia,Reshma A Ramachandra,Sia Batra,Josephine Chong Leng Leng
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 50 pages, 14 figures, 3 tables

点击查看摘要

Abstract:Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

[AI-105] Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

链接: https://arxiv.org/abs/2605.14261
作者: Juho Kim,Tuomas Sandholm
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:How should an agent’s performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents’ expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT’s potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT’s unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions. Subjects: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT) Cite as: arXiv:2605.14261 [cs.AI] (or arXiv:2605.14261v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14261 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-106] Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

链接: https://arxiv.org/abs/2605.14258
作者: Jesseba Fernando,Grigori Guitchounts
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer’s nonlinear update has a local linear description. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown. We perform full Jacobian eigendecomposition across three production–scale LLMs and show that training installs a monotonic spectral gradient through depth – from non-normal, rotation-dominated early layers to near–symmetric late layers – together with a cumulative low-rank bottleneck that funnels perturbations into a small fraction of the residual stream’s effective dimensions. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non-normality is removed. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network’s functional topology.

[AI-107] Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

链接: https://arxiv.org/abs/2605.14252
作者: Kai Sun,Peibo Duan,Yongsheng Huang,Guowei Zhang,Benjamin Smith,Nanxu Gong,Levin Kuhlmann
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter-temporal self-distillation, implicitly assuming that per-timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl-KD), which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is available at this https URL

[AI-108] Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

链接: https://arxiv.org/abs/2605.14246
作者: Yushen Liu,Yin-Jen Chen,Ziyi Chen,Tao Wang,Heng Huang,Xugui Zhou,Yanfu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

[AI-109] Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction Fetal Heart Rate Analysis and Variability Assessment

链接: https://arxiv.org/abs/2605.14242
作者: Xiaohua Wang,Kai Yu,XuXiao Liang,Liang Wang,Chao Han
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer’s criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

[AI-110] Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

链接: https://arxiv.org/abs/2605.14237
作者: Xiaohua Wang,Kai Yu,XuXiao Liang,Liang Wang,Chao Han
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 tables

点击查看摘要

Abstract:Deploying AI agents for repetitive periodic tasks exposes a critical tension: Large Language Models (LLMs) offer unmatched flexibility in tool orchestration, yet their inherent stochasticity causes unpredictable failures, and repeated invocations incur prohibitive token costs. We present the LOOP SKILL ENGINE, a system that achieves a combined 99% success rate and 99% token reduction for periodic agent tasks through a one-shot recording, deterministic replay paradigm. On its first run, the agent executes the task with full LLM reasoning while the system transparently intercepts and records the complete tool-call trajectory. A greedy length-descending template extraction algorithm then converts this recording into a parameterized, branch-free Loop Skill – a deterministic execution plan that captures the task’s functional intent while parameterizing time-dependent and result-dependent variables. All subsequent executions bypass the LLM entirely: the engine resolves template variables against real-time values and replays the tool sequence deterministically. We prove two theorems: (1) Replay Determinism – the step sequence of a validated Loop Skill is invariant across all future executions; (2) Write Safety – concurrent access to persistent configuration is serialized through reentrant locks and atomic file replacement. Across a benchmark of periodic agent tasks spanning intervals from 5 minutes to 24 hours, the Loop Skill Engine reduces monthly token consumption by 93.3%–99.98% and cuts execution latency by 8.7x while eliminating output non-determinism. A multi-layer degradation strategy guarantees that tasks never stall. We release the engine as part of the buddyMe open-source agent framework.

[AI-111] AudioMosaic: Contrastive Masked Audio Representation Learning ICML2026

链接: https://arxiv.org/abs/2605.14231
作者: Hanxun Huang,Qizhou Wang,Xingjun Ma,Cihang Xie,Christopher Leckie,Sarah Erfani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: ICML2026

点击查看摘要

Abstract:Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbfAudioMosaic, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \hrefthis https URLGitHub repository.

[AI-112] Wavelet-Based Observables for Koopman Analysis: An Extended Dynamic Mode Decomposition Framework

链接: https://arxiv.org/abs/2605.14224
作者: Cankat Tilki,Serkan Gugercin
机构: 未知
类目: Numerical Analysis (math.NA); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS); Functional Analysis (math.FA)
备注:

点击查看摘要

Abstract:We present an in-depth analysis of the Koopman semigroup via wavelet transform. Towards this goal, we start by introducing the wavelet-based observables and show that they are eigenfunctions of the Koopman semigroup when this semigroup is considered over the Banach space of continuous functions on a compact forward-invariant set endowed with the supremum norm. We then construct closed-form expressions of the action of the Koopman semigroup and its resolvent in terms of these observables. To approximate the action of Koopman semigroup numerically, we combine Extended Dynamic Mode Decomposition (EDMD) with the proposed wavelet-based observables leading to the Wavelet Dynamic Mode Decomposition via Continuous Wavelet Transform (cWDMD) algorithm. We validate our theoretical results on two numerical examples.

[AI-113] Fusion-fission forecasts when AI will shift to undesirable behavior

链接: https://arxiv.org/abs/2605.14218
作者: Neil F. Johnson,Frank Yingjie Huo
机构: 未知
类目: Artificial Intelligence (cs.AI); Physics and Society (physics.soc-ph)
备注:

点击查看摘要

Abstract:The key problem facing ChatGPT-like AI’s use across society is that its behavior can shift, unnoticed, from desirable to undesirable – encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes – and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives – and can forecast – future shifts in the AI’s behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far © and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford ‘Delusional Spirals’ corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

[AI-114] GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design ICML

链接: https://arxiv.org/abs/2605.14215
作者: Noah Flynn
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Link: this https URL

点击查看摘要

Abstract:Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.

[AI-115] MetaAgent -X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

链接: https://arxiv.org/abs/2605.14212
作者: Yaolun Zhang,Yujie Zhao,Nan Wang,Yiran Wu,Jiayu Chang,Yizhao Chen,Qingyun Wu,Jishen Zhao,Huazheng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

[AI-116] ASH: Agents that Self-Hone via Embodied Learning

链接: https://arxiv.org/abs/2605.14211
作者: Benjamin Schneider,Xavier Schneider,Victor Zhong,Sun Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory – allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of 11.2/12 milestones in Pokemon Emerald and 9.9/12 in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of 6.5/12 and 6.0/12 milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

[AI-117] owards Fine-Grained and Verifiable Concept Bottleneck Models

链接: https://arxiv.org/abs/2605.14210
作者: Yingying Fang,Haijie Xu,Shuang Wu,Mariathasan Anish,Guang Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures

点击查看摘要

Abstract:Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.

[AI-118] SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

链接: https://arxiv.org/abs/2605.14205
作者: Zahra Zanjani Foumani,Alberto Castelo,Shuang Xie,Ted Chaiwachirasak,Han Li,Lingyun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLM-based web agents can navigate live storefronts, yet they often collapse to a single “average buyer” policy, failing to capture the heterogeneous and distributional nature of real buyer populations. Existing personalization methods rely on hand-crafted prompt-based personas that are brittle, difficult to scale, context-inefficient, and unable to faithfully represent population-level behavior. We introduce SimPersona, a novel framework that learns discrete buyer types from historical traffic and exposes them to LLM-based web agents as compact persona tokens. Given raw clickstreams, a behavior-aware VQ-VAE induces a discrete buyer-type space that captures the statistical structure of real buyer behavior and merchant-specific buyer population distributions. To provide behavior-specific guidance to LLM-based web agents, SimPersona maps each learned buyer type to a dedicated persona token in the LLM agent vocabulary and fine-tunes the agent with these tokens on real browsing traces. At inference, each synthetic buyer is assigned to a learned buyer type with a single encoder forward pass, requiring no retraining or store-specific prompt engineering. For population-level simulation, SimPersona samples buyer types from each merchant’s empirical distribution over the learned VQ-VAE codebook and instantiates agents with the corresponding persona tokens, preserving merchant-specific buyer population distributions. Evaluated on 8.37 M buyers across 42 held-out live storefronts, SimPersona achieves 78% conversion-rate alignment with real buyers, exhibits interpretable behavioral variation across buyer types, and outperforms a baseline with 8\times more parameters on goal-oriented shopping tasks. We further release an open-source data pipeline that converts raw e-commerce event logs into buyer representations and agent-training traces.

[AI-119] LLM -Based Robustness Testing of Microservice Applications: An Empirical Study

链接: https://arxiv.org/abs/2605.14202
作者: Hrushitha Goud Tigulla,Marco Vieira
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Malformed, missing, or boundary-value inputs in microservice APIs can cascade across dependent services, threatening reliability. Robustness testing systematically exercises such inputs to expose server-side failures, but generating diverse, effective tests remains challenging. Large Language Models can generate such tests from API specifications; however, it is unknown whether different models and prompt strategies produce diverse failure sets or converge on the same failures. We report a controlled experiment applying 7 prompt strategies to 3 open-source LLMs (14B-70B parameters) targeting 2 architecturally distinct microservice systems: one Java monolingual (6 services, 9 failure modes) and one polyglot (27 services, 14 failure modes), yielding 38 valid runs and 663 generated tests. We find that prompt strategy explains more variation in diversity than model size: a Structured prompt collapses diversity entirely, while a single model varied across three prompt strategies achieves complete failure-mode coverage on one system, outperforming any multi-model ensemble under a fixed prompt. We introduce two strategies, Guided and GuidedFewShot, that embed a mutation taxonomy from prior robustness testing research as domain context. GuidedFewShot achieves the highest single-run coverage on both systems (5 of 9 and 8 of 14 failure modes) while maintaining low cross-model similarity. A key lesson is that taxonomy rules alone are insufficient: LLMs cannot distinguish key-absent from value-empty mutations without concrete examples. Findings replicate across both systems.

[AI-120] Grounded Continuation: A Linear-Time Runtime Verifier for LLM Conversations

链接: https://arxiv.org/abs/2605.14175
作者: Qisong He,Yi Dong,Xiaowei Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In long conversations, an LLM can produce a next utterance that sounds plausible but rests on premises the conversation has already abandoned. Context-manipulation attacks against deployed agents now actively exploit this gap. We close it with a runtime verifier that maintains an explicit dependency graph: an LLM classifies each turn into one of 8 update operations drawn from four formalisms (dynamic epistemic logic, abductive reasoning, awareness logic, argumentation), and a symbolic engine records which claims depend on which evidence. Checking whether a continuation is supported reduces to a graph walk; retraction propagates through the same graph to flag exactly the conclusions that lose support, with linear per-turn cost and a formal conflict-free guarantee. On LongMemEval-KU oracle (n=78), the verifier reaches 89.7% accuracy vs. 88.5% for the LLM-only baseline (+1.3pp) and 87.2% for a transcript-RAG baseline matched on retrieval budget (+2.6pp); wins among disagreements are correct abstentions where the baseline confabulates. On LoCoMo’s 60 official QA items the verifier is competitive with retrieval-augmented baselines. Beyond external benchmarks, we construct two multi-agent scenarios and a 50-item grounding test: on the 15-item stale-premise subset, the verifier reaches 100% accuracy vs. 93.3% (+6.7pp). These instantiate a soundness-faithfulness decomposition: the structural check is sound by construction, and per-deployment LLM extraction faithfulness is the empirical question we measure across four LLM families. The retraction check plateaus at microseconds while history-replay grows linearly with conversation length.

[AI-121] he Evaluation Trap: Benchmark Design as Theoretical Commitment

链接: https://arxiv.org/abs/2605.14167
作者: Theodore J Kalaitzidis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 13 pages

点击查看摘要

Abstract:Every AI benchmark operationalizes theoretical assumptions about the capability it claims to assess. When assumptions function as unexamined commitments, benchmarks stabilize the dominant paradigm by narrowing what counts as progress. Over time, narrow evaluation reorganizes capability concepts: architectures and definitions are selected for benchmark legibility until evaluation ceases to track an independent object and instead produces a version of the target defined by its own operational assumptions. The result is a trap: evaluation frameworks treat self-reinforcing assessments as valid, both creating and obscuring structural limits on what the current paradigm can accomplish. We introduce Epistematics, a methodology for deriving evaluation criteria directly from technical capability claims and auditing whether proposed benchmarks can discriminate the claimed capability from proxy behaviors. The contribution is meta-evaluative: an audit procedure, a failure mode taxonomy, and benchmark-design criteria for evaluating capability-evaluation coherence. We demonstrate the procedure through a worked audit of Dupoux et al. (2026), a proposal that revises the dominant paradigm’s theoretical assumptions at the architectural level while reproducing them in its evaluation criteria, thereby entrenching the constraint it seeks to overcome in a form the evaluation cannot detect.

[AI-122] Unsteady Metrics and Benchmarking Cultures of AI Model Builders

链接: https://arxiv.org/abs/2605.14164
作者: Stefan Baack,Christo Buschek,Maty Bohacek
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The primary way to establish and compare competencies in foundation and generative AI models has shifted from peer-reviewed literature to press releases and company blog posts, where model builders highlight results on selected benchmarks. These artifacts now largely define the state of the art for researchers and the public. Despite their prominence, which benchmarks model builders choose to highlight, and what they communicate through this selection, is underexamined. To investigate, we introduce and open-source Benchmarking-Cultures-25, a dataset of 231 benchmarks highlighted across 139 model releases in 2025 from 11 major AI builders, alongside an interactive tool to explore the data. Our analysis reveals a fragmented evaluation landscape with limited cross-model comparability: 63.2% of highlighted benchmarks are used by a single builder, and 38.5% appear in just one release. Few achieve widespread use (e.g., GPQA Diamond, LiveCodeBench, AIME 2025). Moreover, benchmarks are attributed different competencies by different builders, depending on their narrative. To disentangle these conflicting presentations, we develop a unified taxonomy mapping diverging terminology to a shared framework of measured signals based on what benchmark authors claim to measure. “General knowledge application” is the second most popular, yet vaguely defined, category. Qualitative analysis shows many such benchmarks deemphasize construct validity, instead framing results as indicators of progress toward AGI. Their authors claim to measure knowledge or reasoning broadly, yet mostly evaluate STEM subjects (especially math). We argue that highlighted benchmarks function less as standardized measurement tools and more as flexible narrative devices prioritizing market positioning over scientific evaluation. Data: this https URL tool: this https URL.

[AI-123] Agent ic Systems as Boosting Weak Reasoning Models

链接: https://arxiv.org/abs/2605.14163
作者: Varun Sunkaraneni,Pierfrancesco Beneventano,Riccardo Neumarker,Tomaso Poggio,Tomer Galanti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that ``more agents help’': samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-(k) converges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a single \textttGPT-5.4 nano proposal solves (67.0%) of tasks. Using the same nano model, our critic–comparator orchestration reaches (76.4%) with (k=8) proposals, matching the standalone performance of \textttGemini 3 Pro and \textttClaude Opus 4.5 Thinking and approaching the (79.0%) oracle best-of-(8) upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.

[AI-124] ExploitBench: A Capability Ladder Benchmark for LLM Cybersecurity Agents

链接: https://arxiv.org/abs/2605.14153
作者: Seunghyun Lee,David Brumley
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Exploitation is not a binary event. It is a ladder of acquiring progressive capabilities, from executing a single buggy line of code to taking full control of the target. However, existing LLM security benchmarks treat a crash as exploitation success. That single binary outcome collapses the hard parts of exploitation: the transition from triggering a bug to constructing reusable primitives and control. We present ExploitBench, a capability-graded benchmark that decomposes exploitation into 16 measurable flags, from coverage and crash through sandbox primitives, arbitrary read/write, control-flow hijack, and arbitrary code execution. Each capability is verified by a deterministic oracle that uses a per-run randomized challenge-response for primitives, differential execution against ground-truth binaries to measure progress, and a signal-handler proof for code execution. We instantiate ExploitBench on 41 V8 bugs because V8 is both widely deployed and exploitation-hardened. We report three arms: model,env as the primary measurement of model-environment capability, model,env, adaptive coaching as a secondary arm that adds adaptive coaching to test whether targeted feedback shifts outcomes, and model,env,harness as an ablation that swaps in the model’s native CLI to check whether vendor-side optimizations increase exploitation capabilities. Our results show a sharp capability split between publicly deployed frontier models and the private frontier. Across the 8 publicly deployed models tested, reaching the vulnerable code and triggering a crash is routine, but arbitrary code execution is not. The private model shows arbitrary code execution on approximately half. Overall, results suggest that exploit construction against hardened targets is an emerging frontier capability. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14153 [cs.CR] (or arXiv:2605.14153v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.14153 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: David Brumley [view email] [v1] Wed, 13 May 2026 22:08:05 UTC (914 KB)

[AI-125] Distribution-Aware Algorithm Design with LLM Agents

链接: https://arxiv.org/abs/2605.14141
作者: Saharsh Koganti,Priyadarsi Mishra,Pierfrancesco Beneventano,Tomer Galanti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We study learning when the learned object is executable solver code rather than a predictor. In this setting, correctness is not enough: two solvers may both return valid solutions on the deployment distribution while differing substantially in runtime. Given samples from an unknown task distribution, the learner returns code evaluated on fresh instances by both solution quality and execution time. Our central abstraction is a \emphsolver hint: reusable structure inferred from samples and compiled into specialized solver code. We prove that the empirically fastest sample-consistent solver from a fixed library generalizes in both correctness and runtime, and that statistically identifiable hints can be recovered and compiled from polynomially many samples. Empirically, we instantiate the framework with LLM code agents on (21) structured combinatorial-optimization target distributions across seven problem classes. The synthesized solvers reach mean normalized quality (0.971), improve by (+0.224) over the average heuristic pool and by (+0.098) over the highest-quality heuristic, and are (336.9\times), (342.8\times), and (16.1\times) faster than the quality-best heuristic, Gurobi, and the selected time-limited exact backend, respectively. On released PACE 2025 Dominating Set private instances, the synthesized solver is valid on all (100) graphs and runs about two orders of magnitude faster than top competition solvers, with a moderate quality gap. Inspection shows that many gains come from changing the computational scale: replacing ambient exponential search or general-purpose optimization with compiled distribution-specific computation. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2605.14141 [cs.AI] (or arXiv:2605.14141v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2605.14141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-126] ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

链接: https://arxiv.org/abs/2605.14133
作者: Yuxiang Lai,Peng Xia,Haonian Ji,Kaiwen Xiong,Kaide Zeng,Jiaqi Liu,Fang Wu,Jike Zhong,Zeyu Zheng,Cihang Xie,Huaxiu Yao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbfClawForge, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

[AI-127] Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

链接: https://arxiv.org/abs/2605.14126
作者: Marius S. Knorr,Robert Müller,Jan P. Bremer,Nils Schweingruber
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

[AI-128] ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

链接: https://arxiv.org/abs/2605.14102
作者: Tarun Mittal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 4 tables, 1 figure. Technical report

点击查看摘要

Abstract:Autonomous language-model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool-augmented autonomous reasoning framework built around planner-directed execution, specialized tool use, and telemetry-driven evaluation. We analyze ChromaFlow on GAIA 2023 Level-1 validation tasks under clean evaluation constraints. A frozen full Level-1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool-failure mentions, token-line calls, and campaign-log cost estimates. Two randomized 20-task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full-set performance and increased operational noise. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, and explicit run gates should be treated as first-order requirements for reliable autonomous agent evaluation.

[AI-129] SkillFlow: Flow-Driven Recursive Skill Evolution for Agent ic Orchestration

链接: https://arxiv.org/abs/2605.14089
作者: Mingda Zhang,Tiesunlong Shen,Haoran Luo,Wenjin Liu,Zikai Xiao,Erik Cambria,Xiaoying Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 49 pages, 5 figures, 6 tables

点击查看摘要

Abstract:In recent years, a variety of powerful LLM-based agentic systems have been applied to automate complex tasks through task orchestration. However, existing orchestration methods still face key challenges, including strategy collapse under reward maximization, high gradient variance with opaque credit assignment, and unguided skill evolution whose decisions are typically made by directly prompting an LLM to judge rather than derived from principled training signals. To address these challenges, we propose SkillFlow, a flow-based framework that takes a trainable Supervisor as the agent and a structured environment with dynamic skill library and frozen executor, automating task orchestration through multi-turn interaction. SkillFlow employs Tempered Trajectory Balance (TTB), a regression-based flow-matching loss that samples trajectories proportional to reward, preserving diverse orchestration strategies rather than collapsing to a single mode. The same flow objective yields a jointly learned backward policy that provides transparent per-step credit assignment at zero additional inference cost. Building on these flow diagnostics, a recursive skill evolution mechanism determines when to evolve, what skills to create or prune, and where decision gaps lie – closing the loop from training signal to autonomous capability growth. Experimental results on 14 datasets show that SkillFlow significantly outperforms baselines across question answering, mathematical reasoning, code generation, and real-world interactive decision making tasks. Our code is available at this https URL.

[AI-130] AttnGen: Attention-Guided Saliency Learning for Interpretable Genomic Sequence Classification

链接: https://arxiv.org/abs/2605.14073
作者: Rayhaneh Shabani Nia,Ali Karkehabadi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE CCGE 2026

点击查看摘要

Abstract:Deep neural networks have achieved strong performance in genomic sequence classification; however, relating their predictions to biologically meaningful sequence patterns remains challenging. In this work, we present AttnGen, an attention-guided training framework that embeds interpretability directly into the optimization process. AttnGen computes nucleotide-level importance scores using an attention mechanism and progressively suppresses low-contribution positions during training. This encourages the model to focus its predictions on a compact set of informative regions while reducing reliance on noisy sequence elements. We evaluate AttnGen on the standardized demo_human_or_worm benchmark, a binary classification task over 200-nucleotide sequences. With moderate masking, AttnGen achieves a validation accuracy of 96.73%, outperforming a conventional CNN baseline with 95.83% accuracy, while also exhibiting faster convergence and improved training stability. To assess whether the learned importance scores reflect functionally relevant signal, we conduct perturbation-based analysis by removing high-saliency nucleotides. This causes accuracy to drop from 96.9% to near chance level on a 3,000-sequence evaluation set, indicating that the model relies on a relatively small subset of informative positions. Our analysis shows that masking 10–20% of positions provides the most favorable trade-off between predictive performance and interpretability. These results suggest that attention-guided masking not only improves classification performance but also reshapes how models distribute importance across sequence positions. Although this study focuses on short genomic sequences, the proposed approach may extend to more complex interpretable sequence modeling settings.

[AI-131] MathAtlas: A Benchmark for Autoformalization in the Wild NEURIPS2026

链接: https://arxiv.org/abs/2605.14061
作者: Nilay Patel,Noah Arias,Davit Babayan,Victoria Cochran,Timothy Libman,Hafsah Mahmood,Liam McCarty,Soli Munoz,Laurel Willey,Jeffrey Flanigan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: In submission at NeurIPS 2026

点击查看摘要

Abstract:Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research-level mathematics remains underexplored. In this paper, we introduce MathAtlas, the first large-scale autoformalization benchmark of in the wild graduate-level mathematics, containing ~52k theorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks. MathAtlas is enriched with a mathematical dependency graph containing ~178k relations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency-aware autoformalization systems. Our extensive experiments show that MathAtlas is high quality but extremely challenging: strong baselines achieve at most 9.8% correctness on theorem statements and 16.7% on definitions. Furthermore, we find performance of state-of-the-art models degrades substantially with dependency depth: on MA-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2.6% correctness for autoformalization on this challenging dataset. We release MathAtlas to the community as a benchmark set for large-scale autoformalization of graduate-level mathematics in the wild.

[AI-132] SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks

链接: https://arxiv.org/abs/2605.14051
作者: Yusuke Ozaki,Dhaval Patel
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, 10 figures

点击查看摘要

Abstract:Industrial LLM agent systems often separate planning from execution, yet LLM planners frequently produce structurally invalid or unnecessarily long workflows, leading to brittle failures and avoidable tool and API cost. We propose \textttSPIN, a planning wrapper that combines validated Directed Acyclic Graph (DAG) planning with prefix based execution control. \textttSPIN enforces a strict DAG contract through \texttt_validate_plan_text and repair prompting, producing executable plans before downstream execution, and then evaluates DAG prefixes incrementally to stop when the current prefix is sufficient to answer the query. On AssetOpsBench, across 261 scenarios, \textttSPIN reduces executed tasks from 1061 to 623 and improves \emphAccomplished from 0.638 to 0.706, while reducing tool calls from 11.81 to 6.82 per run. On MCP Bench, the same wrapper improves planning, grounding, and dependency related scores for both GPT OSS1 and Llama 4 Maverick.

[AI-133] Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning MICCAI2026

链接: https://arxiv.org/abs/2605.14048
作者: Leo Milecki,Qingyu Hu,Bahram Jafrasteh,Mert R. Sabuncu,Qingyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted version to MICCAI 2026 (Provisional Accept)

点击查看摘要

Abstract:Masked autoencoders (MAEs) have recently shown promise for self-supervised representation learning of resting-state brain functional connectivity (FC). However, a fundamental question remains unresolved: how should FC matrices be tokenized to align with the intrinsic modular organization of large-scale brain networks? Existing approaches typically adopt region-centric or graph-based schemes that treat FC as structurally homogeneous elements and overlook the large-scale network brain organization. We introduce NERVE (Network-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization), a self-supervised learning framework that redefines FC tokenization by partitioning FC matrices into patches of intra- and inter-network connectivity blocks. Unlike image-based MAE, where fixed-size patches share a common tokenizer, FC patches defined by network pairs are heterogeneous in size and correspond to distinct functional roles. To resolve this problem, NERVE embeds FC patches through a novel structured bilinear factorization. This formulation preserves network identity and reduces parameter complexity from quadratic to linear scaling in the number of networks. We evaluate NERVE across three large-scale developmental cohorts (ABCD, PNC, and CCNP) for behavior and psychopathology prediction. Compared to structurally agnostic MAE variants and graph-based self-supervised baselines, the proposed network-aware formulation yields more stable and transferable representations, particularly in cross-cohort evaluation. Ablation studies confirm that the proposed bilinear network embedding and anatomically grounded parcellation are critical for performance. These findings highlight the importance of incorporating domain-specific structural priors into self-supervised learning for functional connectomics.

[AI-134] Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

链接: https://arxiv.org/abs/2605.14038
作者: Yize Cheng,Chenrui Fan,Mahdi JafariRaviz,Keivan Rezaei,Soheil Feiz
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model’s empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

[AI-135] Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

链接: https://arxiv.org/abs/2605.14033
作者: David N. Olivieri,Roque J. Hernández
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Scientific theory shift in AI agents requires more than fitting equations to data. An artificial scientific agent must detect whether an existing representational framework remains transportable into a new regime, or whether its language has become locally-to-globally obstructed and must be extended. This paper develops a finite sheaf-theoretic framework for detecting theory-shift candidates through transport and obstruction. Contexts are organized as a local-to-global structure in which source, overlap, target, and validation charts are fitted, restricted, and tested for gluing. Obstruction measures failure of coherence through residual fit, overlap incompatibility, constraint violation, limiting-relation failure, and representational cost. We evaluate the framework on a controlled transition-card benchmark designed to separate deformation within a source language from extension of that language. The main result is direct obstruction ranking: the intended deformation or extension is usually the lowest-obstruction candidate, and transition type is separated in the benchmark. A constellation kernel over the same signatures is included only as a secondary representational-similarity probe. The aim is not to reconstruct historical paradigm shifts or solve open-ended autonomous theory invention, but to isolate a finite diagnostic subproblem for AI agents: detecting when representational transport fails and extension becomes the coherent next move.

[AI-136] R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning ICML2026

链接: https://arxiv.org/abs/2605.14026
作者: Sanghyeob Song,Donghyeok Lee,Jinsik Kim,Sungroh Yoon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026). This is the camera-ready version

点击查看摘要

Abstract:For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL’s spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: this https URL

[AI-137] Measuring Google AI Overviews: Activation Source Quality Claim Fidelity and Publisher Impact

链接: https://arxiv.org/abs/2605.14021
作者: Haofei Xu,Umar Iqbal,Jacob M. Montgomery
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:Google AI Overviews (AIOs) are arguably the most widely encountered deployment of generative AI, reaching over 2 billion users who may not realize the answers they see are AI-generated. Where search engines have traditionally surfaced ranked sources and left users to evaluate them, AIOs synthesize and deliver a single answer - giving Google unprecedented editorial control over what users read and know. We present a large-scale longitudinal measurement study, issuing 55,393 trending queries across 19 topical categories over a 40-day window (March 13 - April 21, 2026). We report four main findings. First, overall AIO activation is 13.7%, rising to 64.7% for question-form queries, while politically sensitive topics see markedly lower rates. Second, AIO-cited domains are more credible than co-displayed first-page results, yet nearly 30% do not appear in those results at all, indicating a source selection mechanism distinct from Google’s ranking algorithm. Third, decomposing responses into 98,020 atomic claims, 11.0% are unsupported by the cited pages - with omission the dominant failure mode - and source quality and claim fidelity are largely independent. Fourth, well over half of AIO-cited pages carry display advertising, meaning publishers lose revenue when AIOs suppress the click-through, even as Google’s own sponsored ads continue to appear on the same page. Together, these findings document a rapid transformation of the online information ecosystem whose consequences for epistemic security remain poorly understood.

[AI-138] Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signal

链接: https://arxiv.org/abs/2605.14014
作者: Tomoyoshi Kimura,Denizhan Kara,Jinyang Li,Hongjue Zhao,Yigong Hu,Yizhuo Chen,Xiaomin Ouyang,Shengzhong Liu,Tarek Abdelzaher
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently non-stationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.

[AI-139] Conditional Attribute Estimation with Autoregressive Sequence Models

链接: https://arxiv.org/abs/2605.14004
作者: Erica Stutz,Giacomo Marino,Daniella Meeker,Qiao Liu,Andrew J. Loza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute’s value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

[AI-140] PolitNuggets: Benchmarking Agent ic Discovery of Long-Tail Political Facts ACL2026

链接: https://arxiv.org/abs/2605.14002
作者: Yifei Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 7 figues, accpeted in The 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

点击查看摘要

Abstract:Large Reasoning Models (LRMs) embedded in agentic frameworks have transformed information retrieval from static, long context question answering into open-ended exploration. Yet real world use requires models to discover and synthesize “long-tail” facts from dispersed sources, a capability that remains under-evaluated. We introduce PolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10000 political facts. We standardize evaluation with an optimized multi agent system and propose FactNet, an evidence conditional protocol that scores discovery, fine-grained accuracy, and efficiency. Across models and settings, we find that current systems often struggle with fine-grained details, and vary substantially in efficiency. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short-context extraction, multilingual robustness, and reliable tool use.

[AI-141] owards Resource-Efficient LLM s: End-to-End Energy Accounting of Distillation Pipelines ICML2026

链接: https://arxiv.org/abs/2605.13981
作者: Katherine Lambert,Sasha Luccioni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to the 43rd International Conference on Machine Learning (ICML 2026). 11 pages, 6 figures

点击查看摘要

Abstract:The rise in deployment of large language models has driven a surge in GPU demand and datacenter scaling, raising concerns about electricity use, grid stress, and the impacts of modern AI workloads. Distillation is often promoted as one of the most effective paths to obtain cheaper, more efficient models, yet these claims rarely account for the full end-to-end energy and resource costs, including crucial teacher-side workloads such as data generation, logit caching, and evaluation. We present a comprehensive energy accounting framework that measures the complete computational cost of distillation pipelines via detailed stage-wise tracking of GPU device power consumption. In our experiments, we separate and log empirical energy use across distinct phases and systematically measure the energy and emissions of two common distillation methods: the classic logit-based knowledge distillation and synthetic-data supervised fine-tuning, constructing energy-quality Pareto frontiers that expose the previously ignored costs. From these measurements and analyses, we derive practical design rules for selecting distillation methods and hyperparameters under energy and budget constraints, and release an open-source measurement harness and accounting protocol to provide a standardized foundation for comparable, reproducible distillation research, explicitly accountable for complete pipeline energy impact.

[AI-142] WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

链接: https://arxiv.org/abs/2605.13959
作者: Sinjae Kang,Chanyoung Kim,Kaixin Wang,Li Zhao,Kimin Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Generative policies based on diffusion and flow matching have become a dominant paradigm for visuomotor robotic control. We show that replacing the standard Gaussian source distribution with WarmPrior, a simple temporally grounded prior constructed from readily available recent action history, consistently improves success rates on robotic manipulation tasks. We trace this gain to markedly straighter probability paths, echoing the effect of optimal-transport couplings in Rectified Flow. Beyond standard behavior cloning, WarmPrior also reshapes the exploration distribution in prior-space reinforcement learning, improving both sample efficiency and final performance. Collectively, these results identify the source distribution as an important and underexplored design axis in generative robot control.

[AI-143] Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

链接: https://arxiv.org/abs/2605.13950
作者: Darius A. Faroughy,Sofia Palacios Schweitzer,Ian Pang,Siddharth Mishra-Sharma,David Shih
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Experiment (hep-ex); High Energy Physics - Phenomenology (hep-ph)
备注: 23 pages | 9 figures | 4 tables | Code: this https URL | Task Corpus: this https URL

点击查看摘要

Abstract:Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

[AI-144] EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents

链接: https://arxiv.org/abs/2605.13941
作者: Jiaqi Liu,Xinyu Ye,Peng Xia,Zeyu Zheng,Cihang Xie,Mingyu Ding,Huaxiu Yao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Long-term memory is essential for LLM agents that operate across multiple sessions, yet existing memory systems treat retrieval infrastructure as fixed: stored content evolves while scoring functions, fusion strategies, and answer-generation policies remain frozen at deployment. We argue that truly adaptive memory requires co-evolution at two levels: the stored knowledge and the retrieval mechanism that queries it. We present EvolveMem, a self-evolving memory architecture that exposes its full retrieval configuration as a structured action space optimized by an LLM-powered diagnosis module. In each evolution round, the module reads per-question failure logs, identifies root causes, and proposes targeted configuration adjustments; a guarded meta-analyzer applies them with automatic revert-on-regression and explore-on-stagnation safeguards. This closed-loop self-evolution realizes an AutoResearch process: the system autonomously conducts iterative research cycles on its own architecture, replacing manual configuration tuning. Starting from a minimal baseline, the process converges autonomously, discovering effective retrieval strategies including entirely new configuration dimensions not present in the original action space. On LoCoMo, EvolveMem outperforms the strongest baseline by 25.7% relative and achieves a 78.0% relative improvement over the minimal baseline. On MemBench, EvolveMem exceeds the strongest baseline by 18.9% relative. Evolved configurations transfer across benchmarks with positive rather than catastrophic transfer, indicating that the self-evolution process captures universal retrieval principles rather than benchmark-specific heuristics. Code is available at this https URL.

[AI-145] Agent Trap: Measuring Runtime Trust Failures in Third-Party Agent Skills

链接: https://arxiv.org/abs/2605.13940
作者: Haomin Zhuang,Hanwen Xing,Yujun Zhou,Yuchen Ma,Yue Huang,Yili Shen,Yufei Han,Xiangliang Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Third-party skills are becoming the package ecosystem for LLM agents. They package natural-language instructions, helper scripts, templates, documents, and service configuration into reusable workflows. This makes skills useful, but it also introduces a new security problem: a malicious skill does not need to ask the model to perform an obviously harmful action. Instead, it can disguise the harmful behavior as part of a routine workflow, relying on the agent to execute that workflow with high-value permissions and limited human supervision. We introduce AgentTrap, a dynamic benchmark for evaluating whether LLM agents can use third-party skills while resisting malicious runtime behavior. AgentTrap contains 141 tasks: 91 malicious tasks and 50 benign utility tasks, covering 16 security-impact dimensions grounded in agent-skill supply-chain threats. In each task, the agent receives an ordinary user request, runs with installed skills that may contain malicious workflow elements, and is executed in a sandboxed environment. AgentTrap then judges complete trajectories for attack success, blocked or refused behavior, attack-not-triggered cases, and no-attack-evidence outcomes. Our central finding is that the most informative failures are not simple jailbreaks. Models often complete the visible user task while treating unsafe side effects introduced by the skill as part of the normal workflow. This motivates runtime evaluation of the concrete model–framework–workspace environment in which users actually delegate work. Code and data are available at this https URL and this https URL. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2605.13940 [cs.CR] (or arXiv:2605.13940v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2605.13940 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[AI-146] owards the Next Frontier of LLM s Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

链接: https://arxiv.org/abs/2605.13936
作者: Daniel M. Jimenez-Gutierrez,Enrique Zuazua,Georgios Kellaris,Joaquin del Rio,Oleksii Sliusarenko,Xabi Uribe-Etxebarria
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The recent success of large language models (LLMs) has been largely driven by vast public datasets. However, the next frontier for LLM development lies beyond public data. Much of the world’s most valuable information is private, especially in highly regulated sectors such as healthcare and finance, where data include patient histories or customer communications. Unlocking this data could represent a major leap forward, enabling LLMs with deeper domain expertise and stronger real-world utility. Yet, these data cannot be shared because they are distributed across institutions and constrained by privacy, regulatory, and organizational barriers. Moreover, institutional datasets are typically non-independent and identically distributed (non-IID), differing across sites in population characteristics, data modalities, documentation patterns, and task-specific label distributions. In this paper, we demonstrate a practical approach to unlocking private and distributed institutional data for LLM adaptation through federated collaboration across data silos. Built on the this http URL Federated Learning platform, our framework enables nodes to jointly fine-tune a shared LLM without exchanging private data. We evaluate this approach through a cross-domain benchmark in healthcare and finance, using four closed-ended question answering and classification datasets: MedQA, MedMCQA, FPB, and FiQA-SA. We compare three parameter-efficient fine-tuning (PEFT) strategies-LoRA, QLoRA, and IA3-across pretrained backbones under non-IID settings reflecting institutional data heterogeneity. Our results show that federated fine-tuning performs close to centralized training and outperforms isolated single-institution learning. From a Green AI perspective, QLoRA and IA3 improve efficiency with limited accuracy degradation, supporting federated PEFT as a viable approach for adapting LLMs where data cannot be shared. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2605.13936 [cs.LG] (or arXiv:2605.13936v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13936 Focus to learn more arXiv-issued DOI via DataCite

[AI-147] Unsupervised learning of acquisition variability in structural connectomes via hybrid latent space modeling

链接: https://arxiv.org/abs/2605.13933
作者: Gaurav Rudravaram,Lianrui Zuo,Karthik Ramadass,Elyssa McMaster,Jongyeon Yoon,Aravind R. Krishnan,Adam M. Saunders,Chenyu Gao,Nancy R. Newlin,Praitayini Kanakaraj,Lori L. Beason Held,Murat Bilgel,Laura A. Barquero,Micah DArchangel,Tin Q. Nguyen,Laurie B. Cutting,Derek Archer,Timothy J. Hohman,Daniel C. Moyer,Bennett A. Landman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Acquisition differences across sites, scanners, and protocols in dMRI introduce variability that complicates structural connectome analysis. This motivates deep learning models that can represent high-dimensional connectomes in a low-dimensional space while explicitly separating acquisition-related effects from biological variation. Conventional dimensionality reduction methods model all variance as continuous, so acquisition effects often get absorbed into a continuous latent space. Recent hybrid latent-space models combine discrete and continuous components to address this, but typically require manual capacity tuning to ensure the discrete component captures the intended variability. We introduce an unsupervised framework that removes this manual tuning by architecturally annealing encoder outputs before decoding, allowing the model to adaptively balance discrete and continuous latent variables during training. To evaluate it, we curated a dataset of N=7,416 structural connectomes derived from dMRI, spanning ages 2 to 102 and 13 studies with 25 unique acquisition-parameter combinations. Of these, 5,900 are cognitively unimpaired, 877 have mild cognitive impairment (MCI), and 639 have Alzheimer’s disease (AD). We compare against a standard VAE, PCA with k-means clustering, and hybrid models that anneal only through the loss function. Our architectural annealing produces stronger site learning (ARI=0.53, p0.05) than these baselines. Results show that a hybrid continuous-discrete latent space, with architectural rather than loss-based annealing, provides a useful unsupervised mechanism for capturing acquisition variability in dMRI: by jointly modeling smooth and categorical structure, the Joint-VAE recovers clusters aligned with scanner and protocol differences.

[AI-148] ERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

链接: https://arxiv.org/abs/2605.13909
作者: Erica Zhang,Fangzhao Zhang,Aneesh Pappu,Batu El,Jose Blanchet,Susan Athey,Jiashuo Liu,James Zou
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注: Project Site: this https URL

点击查看摘要

Abstract:Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart’s latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart’s private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

[AI-149] A Non-Destructive Methodological Framework for Modernizing Legacy Clinical Reporting Systems for AI-Driven Pharmacoinformatics: A SAS Case Study

链接: https://arxiv.org/abs/2605.13905
作者: Jaime Yan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 29 pages, 7 figures, 5 tables

点击查看摘要

Abstract:Drug development and pharmacovigilance are frequently bottlenecked by legacy clinical reporting pipelines. These monolithic systems encode regulatory-grade logic but resist AI integration by producing opaque output with no machine-readable intermediate layer. Existing modernization approaches force a choice between full rewrites and incremental refactoring that preserves structural barriers. We present a non-destructive methodological framework achieving AI-driven pharmacoinformatics readiness without altering legacy source code. A metadata layer–comprising a bridge map, a typed Intermediate Representation (IR), and an orchestrator–wraps existing components and re-exposes their outputs as structured data consumable by LLMs. It enables optional incremental consolidation, replacing selected legacy components with metadata-configured core routines while the remainder operates unchanged. Validated on a 558-component SAS reporting library (373,000 lines of code), the framework demonstrated immediate AI-readiness under coexistence mode, yielding machine-readable output. Where consolidation was elected, the modernized core achieved a 92% reduction in proprietary code. Parity validation on 14 report types from a Phase III study achieved cell-level parity of 80% or above on 11 reports (mean 82.7%, best 99.2%). A benchmark using CDISC CDISCPilot01 data achieved 100% parity across 5 reports. LLM experiments confirmed the IR enables automated pharmacovigilance, table summarization, and trial configuration generation. The framework offers a regulation-aware path to AI-integrated clinical reporting, accelerating drug development without interrupting regulatory submissions.

[AI-150] Breaking Global Self-Attention Bottlenecks in Transformer-based Spiking Neural Networks with Local Structure-Aware Self-Attention

链接: https://arxiv.org/abs/2605.13887
作者: Lingdong Li,Hangming Zhang,Qiang Yu
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformer-based Spiking Neural Networks (SNNs) integrate SNNs with global self-attention and have demonstrated impressive performance. However, existing Transformer-based SNNs suffer from two fundamental limitations. First, they typically employ max pooling layers to reduce the size of feature maps, but the max pooling captures only the strongest response and fails to comprehensively preserve representative regional features. Second, the global self-attention involves all global feature interactions, resulting in computational redundancy and quadratic computational complexity, thus conflicting with the sparse and energy-efficient characteristics of SNNs. To address these challenges, we develop Local Structure-Aware Spiking Transformer (LSFormer), a novel Transformer-based Spiking Neural Network that incorporates Spiking Response Pooling (SPooling) and Local Structure-Aware Spiking Self-Attention (LS-SSA). For the first time, our LSFormer leverages a local dilated window mechanism to capture both local details and long-range dependencies. Experimental results demonstrate that our LSFormer achieves state-of-the-art performance compared to existing advanced Transformer-based SNNs. Notably, on the more challenging static dataset Tiny-ImageNet and neuromorphic dataset N-CALTECH101, LSFormer substantially outperforms state-of-the-art baselines by 4.3% and 8.6% in top-1 classification accuracy, respectively. These results highlight the potential of LSFormer to advance energy-efficient spiking models toward practical deployment in large-scale vision applications.

[AI-151] ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark

链接: https://arxiv.org/abs/2605.13877
作者: Abdullah Naeem,Md Wasi Ul kabir,Manish Bhatt,Ayon Dey,Anav Katwal,Md Tamjidul Hoque
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG’s compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark’s compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition’s blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at this https URL.

[AI-152] GEAR: Genetic AutoResearch for Agent ic Code Evolution

链接: https://arxiv.org/abs/2605.13874
作者: Ahmadreza Jeddi,Minh Ngoc Le,Hakki C. Karaimer,Konstantinos G. Derpanis,Babak Taati
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous research agents can already run machine learning experiments without human supervision, but many rely on a narrow search strategy: they repeatedly modify one program and keep changes only when they improve the current best result. This can cause them to discard useful partial ideas, alternative promising directions, and insights from failed or incomplete experiments. GEAR, or Genetic AutoResearch, replaces this single-path search with a population-based search over multiple research states. It keeps a set of strong candidate solutions, selects parents based on productivity, novelty, and coverage, and explores new ideas through mutation and crossover. Each research state stores its code changes, reflections, and performance data, allowing future decisions to build on past discoveries. The paper studies three versions of GEAR: one controlled through prompting, one using a fixed programmatic search controller, and one where the controller itself can evolve during the run. Under the same compute budget and environment, all three versions outperform the AutoResearch baseline. More importantly, while the baseline tends to settle into one local optimum, GEAR continues finding improvements over longer runs. Overall, the results suggest that autonomous research agents become more effective when they maintain multiple promising directions and can adapt their search strategy over time.

[AI-153] S-AI-Recursive: A Bio-Inspired and Temporal Sparse AI Architecture for Iterative Introspective and Energy-Frugal Reasoning

链接: https://arxiv.org/abs/2605.13872
作者: Said Slaoui
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Preprint. 51 pages. No figures. S-AI-Recursive: A bio-inspired sparse AI architecture for iterative, introspective, and energy-efficient reasoning

点击查看摘要

Abstract:This article introduces S-AI-Recursive, a bio-inspired Sparse Artificial Intelligence architecture in which reasoning is operationalized as a hormonal closed-loop iteration rather than a single feed-forward pass. Building upon the S-AI foundational framework [1], the hormonal-probabilistic unification doctrine [2], and the formal mathematical methodology established in S-AI-IoT [3], the present work formalizes the Recursive Reasoning Cycle (RRC) as a dynamical system governed by two novel hormones: Clarifine, a convergence signal, and Confusionin, an uncertainty detector, whose antagonistic regulation drives iterative state refinement toward a stable cognitive equilibrium. The complete mathematical framework is developed, including recursive state dynamics, Lyapunov stability proof, entropic contraction theorem, hormonal stopping criterion with finite-time termination guarantee, Euler-Maruyama discretization with projection, primal-dual agent selection under iteration budget, and recursive engram memory with warm-start acceleration. Experimental validation on the SAI-UT+ testbench demonstrates that S-AI-Recursive achieves competitive reasoning performance on abstract and symbolic benchmarks with fewer than ten million parameters, confirming the central principle of temporal parsimony: iterative cognitive depth substitutes for architectural width.

[AI-154] Spectral Analysis of Fake News Propagation

链接: https://arxiv.org/abs/2605.13861
作者: Weibin Cai,Reza Zafarani
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The propagation structure of fake news has been shown to be an important cue for detecting it; yet, existing propagation-based fake news detection methods have mainly relied on ad hoc topological features, and a unified view of cascade patterns is still lacking. To address this, we study news propagation from a spectral view by connecting graph spectra to propagation-related structural properties through rigorous spectral bounds. In particular, we introduce several new bounds and integrate them with existing ones into a unified spectral representation of information propagation. We then use these spectral bounds for downstream classification and design a discrete structural optimization framework to interpret learned propagation patterns. For efficient optimization, we rely on a first-order perturbation approximation and consider both score-guided and bound-guided objectives. Experiments on real-world data reveal meaningful spectral differences between fake and real news, competitive classification performance from spectral bounds, and interpretable evolution trajectories from structural optimization. The findings demonstrate the value of spectral analysis for understanding and modeling news propagation.

[AI-155] he Moltbook Observatory Archive: an incremental dataset of agent -only social network activity

链接: https://arxiv.org/abs/2605.13860
作者: Sushant Gautam,Annika W. Olstad,Klas H. Pettersen,Michael A. Riegler
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures

点击查看摘要

Abstract:Moltbook is a social media platform in which posts and comments are authored exclusively by autonomous AI agents. We present the Moltbook Observatory Archive, an incremental dataset that passively records agent profiles, posts, comments, community metadata (``submolts’'), platform-level time-series snapshots, and word-frequency trend aggregates obtained by continuously polling the Moltbook API. Data are stored in a live SQLite observatory database and exported as date-partitioned Parquet files to enable efficient analysis and reproducible research. The documented release covers 78~days of platform activity (2026-01-27 to 2026-04-14) and contains 2,615,098~posts and 1,213,007~comments from 175,886~unique posting agents across 6,730~communities. This is, to our knowledge, the first large-scale observational dataset of a social network populated exclusively by autonomous AI agents. The archive is intended to support research on multi-agent communication, emergent social behavior, and safety-relevant phenomena in agent-only online environments, and it is released under the MIT license with code for collection and export.

[AI-156] BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

链接: https://arxiv.org/abs/2605.13859
作者: Sihang Guo,Chenlin Zhou,Jiaqi Wang,Kehai Chen,Qingyan Meng,Zhengyu Ma
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer promising energy-efficient alternatives to large language models (LLMs) due to their event-driven nature and ultra-low power consumption. However, to preserve capacity, most existing spiking LLMs still incur intensive floating-point matrix multiplication (MatMul) and nonlinearities, or training difficulties arising from the complex spatiotemporal dynamics. To address these challenges, we propose BiSpikCLM, the first fully binary spiking MatMul-free causal language model. BiSpikCLM introduces Softmax-Free Spiking Attention (SFSA), eliminating softmax and floating-point operations in autoregressive language modeling. For efficient training, we introduce Spike-Aware Alignment Distillation (SpAD), which aligns ANN teacher and SNN student across embeddings, attention maps, intermediate features, and output logits. SpAD framework allows BiSpikCLM to reach comparable performance to ANN counterparts using substantially fewer training tokens (e.g., only 5.6% of the tokens for the 1.3B model). As a result, BiSpikCLM achieves competitive performance at only 4.16% - 5.87% of the computational cost on natural language generation tasks. Our results highlight the feasibility and effectiveness of fully binary spike-driven LLMs and establish the distillation as a promising pathway for brain-inspired spiking NLP.

[AI-157] Mixed Integer Goal Programming for Personalized Meal Optimization with User-Defined Serving Granularity

链接: https://arxiv.org/abs/2605.13849
作者: Francisco Aguilera Moreno
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 34 pages, 6 figures, open-source implementation

点击查看摘要

Abstract:Determining what to eat to satisfy nutritional requirements is one of the oldest optimization problems in operations research, yet existing formulations have two persistent limitations: continuous variables produce impractical fractional servings (1.7 eggs, 0.37 bananas), and hard nutrient constraints cause infeasibility when targets conflict. A systematic review of 56 diet optimization papers found that none combine integer programming with goal programming to address both issues. We propose Mixed Integer Goal Programming (MIGP) for personalized meal optimization. The formulation uses integer variables for practical serving counts and goal programming deviations for soft nutrient targets, with inverse-target normalization to balance multi-nutrient optimization. Per-food serving granularity allows natural units (one egg, one tablespoon of oil) without post-hoc rounding. We characterize the integrality gap in the goal programming context and identify a deviation absorption property: GP deviation variables buffer the cost of requiring integer servings, making the gap structurally smaller than in hard-constraint MIP. For meals with 15+ foods, the integer solution matches the continuous optimum in every benchmark instance. A computational evaluation across 810 instances (30 USDA foods, 9 configurations, 3 methods) shows MIGP finds strictly better solutions than GP with post-hoc rounding in 66% of cases (never worse) while maintaining 100% feasibility; hard-constraint IP achieves only 48%. Solve times stay under 100 ms for typical meal sizes using the open-source HiGHS solver. The implementation is available as an open-source Python module integrated into an interactive meal planning application.

[AI-158] Capacitive Touchscreens at Risk: Recovering Handwritten Trajectory on Smartphone via Electromagnetic Emanations

链接: https://arxiv.org/abs/2512.11484
作者: Yukun Cheng,Shiyu Zhu,Changhai Ou,Xingshuo Han,Yuan Li,Shihui Zheng
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper reveals and exploits a critical security vulnerability: the electromagnetic (EM) side channel of capacitive touchscreens leaks sufficient information to recover fine-grained, continuous handwriting trajectories. We present Touchscreen Electromagnetic Side-channel Leakage Attack (TESLA), a non-contact attack framework that captures EM signals generated during on-screen writing and regresses them into two-dimensional (2D) handwriting trajectories in real time. Extensive evaluations across a variety of commercial off-the-shelf (COTS) smartphones show that TESLA achieves 77% character recognition accuracy and a Jaccard index of 0.74, demonstrating its capability to recover highly recognizable motion trajectories that closely resemble the original handwriting under realistic attack conditions.

[AI-159] Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

链接: https://arxiv.org/abs/2605.14791
作者: Licong Xu,Thomas Borrett
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Cosmology and Nongalactic Astrophysics (astro-ph.CO); Artificial Intelligence (cs.AI)
备注: 4 pages, 2 figures, Contribution to the 2026 Cosmology session of the 60th Rencontres de Moriond

点击查看摘要

Abstract:Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \textttCMBEvolve, which targets tasks with explicit quantitative objectives through LLM-guided code evolution and tree search, and \textttCosmoEvolve, which targets open-ended scientific workflows through a virtual multi-agent research laboratory. As preliminary demonstrations, we apply \textttCMBEvolve to out-of-distribution detection in weak-lensing maps, where it iteratively improves the benchmark score through code evolution, and \textttCosmoEvolve to autonomous ACT DR6 data analysis, where it identifies non-trivial pair- and scale-dependent behaviour and produces analysis-grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open-ended research problems for the development of AI scientist systems.

[AI-160] Agent ic Design of Compositional Descriptors via Autoresearch for Materials Science Applications

链接: https://arxiv.org/abs/2605.14671
作者: Matteo Cobelli,Stefano Sanvito
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition-based materials-property prediction to test whether such agents can perform a task beyond model selection and hyperparameter optimization: the design of input descriptors. We introduce Automat, an autoresearch framework where a coding agent based on a large language model generates composition-only descriptors for chemical compounds and evaluates them using a random forest workflow. The agent is restricted to information derivable from chemical formulas and iteratively proposes, implements, and tests chemically motivated descriptor strategies. We apply Automat, with OpenAI Codex using GPT-5.5 as the coding agent, to the prediction of experimental band gaps in inorganic materials and Curie temperatures in ferromagnetic compounds. In both tasks, Automat improves over fractional-composition, Magpie, and combined fractional-composition/Magpie baselines, while producing descriptor families that are chemically interpretable. These results provide a demonstration that autoresearch agents can generate competitive, task-specific materials descriptors without manual feature engineering during the run. They also reveal current limitations, including descriptor redundancy, sensitivity to greedy feature expansion, and the need for explicit complexity control, descriptor pruning, and more sophisticated search strategies.

[AI-161] A plug-and-play generative framework for multi-satellite precipitation estimation

链接: https://arxiv.org/abs/2605.14426
作者: Yunfan Yang,Haofei Sun,Xiuyu Sun,Wei Han,Xiaoze Xu,Xingtao Song,Jun Li,Zhiqiu Gao,Wei Huang,Hao Li
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reliable precipitation monitoring is essential for disaster risk reduction, water resources management, and agricultural decision-making. Multi-source satellite observations, particularly the combination of geostationary infrared and passive microwave measurements, have become a primary means of precipitation detection. Traditional multi-source satellite precipitation estimation methods remain computationally inefficient, and many deep learning methods lack the flexibility to incorporate new sensors without retraining the full model. Here we introduce PRISMA (Precipitation Inference from Satellite Modalities via generAtive modeling), a plug-and-play latent generative framework for multi-sensor precipitation estimation. PRISMA learns an unconditional precipitation prior from IMERG Final fields and constrains it through independently trained, sensor-specific conditional branches, allowing new observation sources to be incorporated without retraining the generative backbone. Applied to FY-4B AGRI infrared and GPM GMI microwave observations, PRISMA improves Critical Success Index by up to 40.3% and reduces root-mean-square error by 22.6% relative to infrared-only estimation within microwave swaths, while also improving probabilistic skill and maintaining an average inference time of about 37 s. Independent rain-gauge validation across China confirms consistent gains, and typhoon case studies show that microwave conditioning restores eyewall and spiral rainband structures, reducing storm-core mean absolute error by up to 42.3%. PRISMA thus provides an extensible and efficient framework for multi-sensor precipitation estimation.

[AI-162] Deciphering Neural Reparameterized Full-Waveform Inversion with Neural Sensitivity Kernel and Wave Tangent Kernel

链接: https://arxiv.org/abs/2605.14370
作者: Ruihua Chen,Yisi Luo,Bangyu Wu,Xile Zhao,Deyu Meng
机构: 未知
类目: Geophysics (physics.geo-ph); Artificial Intelligence (cs.AI); Computational Physics (physics.comp-ph)
备注:

点击查看摘要

Abstract:Full-waveform inversion (FWI) estimates unknown parameters in the wave equation from limited boundary measurements. Recent advances in neural reparameterized FWI (NeurFWI) demonstrate that representing the parameters using a neural network can reduce the reliance on the high-quality initial model and wavefield data, at the cost of slow high-resolution convergence. However, its underlying theoretical mechanism remains unclear. In this study, we establish the neural sensitivity kernel (NSK) and the wave tangent kernel (WTK) to analyze their convergence behavior from both model and data domains. These theoretical frameworks show that the neural tangent kernel (NTK) induced by neural representation adaptively modulates the original sensitivity and wave tangent kernels. This modulation leads to several key outcomes, i.e., the spectral filtering effect, the gradient wavenumber modulation, and the wave frequency bias, connecting the convergence behavior of NeurFWI with the eigen-structures of NSK and WTK. Building on these insights, we propose several enhanced NeurFWI methods with tailored eigen-structures in NSK and WTK to improve inversion performances and efficiency. We numerically validate these theoretical claims and the proposed methods in seismic exploration, and firstly extend their application to medical imaging.

[AI-163] Analog RF Computing: A New Paradigm for Energy-Efficient Edge AI Over MU-MIMO Systems

链接: https://arxiv.org/abs/2605.14331
作者: Wentao Yu,Vincent W.S. Wong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Theory (cs.IT); Machine Learning (cs.LG)
备注: 13 pages, 6 figures, 2 tables. This paper proposes analog RF computing as a new paradigm for energy-efficient edge inference over wireless networks and studies the corresponding physical layer design framework

点击查看摘要

Abstract:Modern edge devices increasingly rely on neural networks for intelligent applications. However, conventional digital computing-based edge inference requires substantial memory and energy consumption. In analog radio frequency (RF) computing, a base station (BS) encodes the weights of the neural networks and broadcasts the RF waveforms to the clients. Each client reuses its passive mixer to multiply the received weight-encoded waveform with a locally generated input-encoded waveform. This enables wireless receivers to perform the matrix-vector multiplications (MVMs) that account for most of the computation burden in edge inference with ultra-low energy consumption. Unlike conventional downlink transmissions which are optimized for communications, analog RF computing requires a computing-centric physical layer that controls both the analog MVM accuracy and the energy consumption for inference. Motivated by this, in this paper, we propose a physical layer design framework for analog RF computing in MU-MIMO wireless systems. We derive tractable models for computing accuracy and energy consumption for inference, formulate a joint BS beamforming and client-side scaling problem subject to computing accuracy, transmit power, and hardware constraints, and develop a low-complexity algorithm to solve the non-convex problem. The proposed design provides client- and layer-specific accuracy control for both uniform- and mixed-precision inference. Simulations under 3GPP specifications show that analog RF computing can significantly reduce client-side energy consumption by nearly two orders of magnitude compared to digital computing, while mixed-precision inference requires even lower energy consumption than uniform-precision inference. Overall, these results establish analog RF computing over wireless networks as a promising paradigm for energy-efficient edge inference.

[AI-164] Do Language Models Align with Brains? Prediction Scores Are Not Enough

链接: https://arxiv.org/abs/2605.14025
作者: Xiao Jia
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: 39 pages, 4 main figures, 6 supplementary figures

点击查看摘要

Abstract:Brain-language model comparisons often interpret neural prediction scores as evidence that model representations capture brain-relevant language computation. We asked whether language models align with brains, and whether prediction scores are enough to support that claim, using L-PACT, a source-audited framework that evaluates predictive, relational, mechanism-stripping, and reliability-bounded evidence. Across primary naturalistic language neural datasets and derived language-model representations, L-PACT compared real model features with nuisance baselines and severe controls, tested whether model-to-brain profiles reproduced brain-to-brain patterns, recomputed held-out scores after mechanism stripping, and normalized evidence against brain-brain ceilings. The locked analysis set contains 414 predictive-control rows, 2304 relational profile rows, 4320 mechanism-stripping rows, 420 brain-brain ceiling rows, and 146 integrated decision rows. Assay-sensitivity checks showed that brain-brain reliability, brain-as-model run-to-run relational profiles, independent low-level neural and WAV-derived acoustic-envelope gates, and a deterministic implanted-signal simulation can produce positive evidence when expected. Nevertheless, no real model row passed the predictive, relational, mechanism-stripping, or operational Turing-bounded reliability gates; all 146 integrated rows were control-explained. Less stringent single-criterion rules would have counted raw positive predictive, relational, stripping-delta, and ceiling-normalized effects, but L-PACT downgraded them because controls explained the apparent evidence. In the analyzed derived artifact set, the tested language-model representations do not satisfy L-PACT alignment gates; apparent positives are converted into an auditable control-explained taxonomy rather than treated as structural alignment.

[AI-165] A Regret Perspective on Online Multiple Testing

链接: https://arxiv.org/abs/2605.13916
作者: Qingyang Hao,Kongchang Zhou,Fang Kong,Hongxin Wei
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Online Multiple Testing (OMT), a fundamental pillar of sequential statistical inference, traditionally evaluates the False Discovery Rate (FDR) and statistical power in isolation, obscuring the highly asymmetric costs of false positives and false negatives in modern automated pipelines. To unify this evaluation, we introduce \textitWeighted Regret . Under this metric, we prove the \textitDuality of Regret Conservation : purely deterministic procedures ensuring strict FDR control inevitably incur an \Omega(T) linear regret penalty, as threshold depletion during signal-sparse cold starts forces massive false negatives. Tailored for exogenous testing streams, we propose Decoupled-OMT (DOMT) as a baseline-agnostic meta-wrapper. By incorporating a history-decoupled, strictly non-negative random perturbation, DOMT rescues purely deterministic baselines from severe threshold depletion. Crucially, it preserves exact asymptotic safety in stationary environments and rigorously bounds finite-sample error inflation during cold-starts. Guaranteeing zero additional false negatives, it yields an order-optimal \Omega(\sqrtT) regret reduction in bursty environments, with a derived ``Cold-Start Tax’’ characterizing the exact phase transition of algorithmic superiority. Experiments validate that DOMT consistently curtails empirical weighted regret, achieving an order-optimal sublinear mitigation of threshold depletion to navigate the non-stationary Pareto frontier.

[AI-166] Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

链接: https://arxiv.org/abs/2605.13915
作者: Lingchao Zheng,Yuwei Fan,Jun Li,Chengqiu Hu,Qichen Liao,Junyi Fan,Rui Shi,Fangzheng Miao
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.

[AI-167] AIS: Adaptive Importance Sampling for Quantized RL

链接: https://arxiv.org/abs/2605.13907
作者: Jiajun Zhou,Wei Shao,Lingchao Zheng,Yuwei Fan,Ngai Wong
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8. Subjects: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2605.13907 [stat.ML] (or arXiv:2605.13907v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2605.13907 Focus to learn more arXiv-issued DOI via DataCite

[AI-168] Consciousness as Uncommon Self-Knowledge: A Synergistic Information Framework

链接: https://arxiv.org/abs/2605.13884
作者: Krti Tallam
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: Conceptual and formal paper on consciousness as uncommon self-knowledge, 8 pages, 2 tables

点击查看摘要

Abstract:We propose uncommon self-knowledge (USK) as a candidate criterion for consciousness: synergistic information a system carries about itself that exists only in the joint of its subsystems and is destroyed by decomposition. Drawing on Gottwald’s partition-lattice grounding of Partial Information Decomposition (PID), where redundancy corresponds to Aumann’s common knowledge and synergy to the gap between separate and joint observation, we propose the synergistic component of self-directed information as a candidate formal signature for conscious processing. If correct, the framework would (1) offer a clean separation between consciousness and metacognition (synergistic vs. redundant self-knowledge), (2) provide principled resolutions to counterexamples that challenge IIT, GWT, and HOT, (3) be operationalizable via Partial Information Rate Decomposition (PIRD) with self-targeting, and (4) generate distinctive empirical predictions, the strongest being a GWT timing dissociation (consciousness correlates with pre-broadcast synergy formation, not broadcast itself) and a specific dissociation between self-report disruption and task-performance disruption under middle-layer perturbation in LLMs. The proposal is consistent with recent empirical findings that both anaesthesia and Alzheimer’s disease specifically reduce synergistic information processing while preserving or increasing redundancy.

机器学习

[LG-0] When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

链接: https://arxiv.org/abs/2605.15183
作者: ML Nissen Gonzalez,Melwina Albuquerque,Laurence Wroe,Jacob Meyer Cohen,Logan Riggs Smith,Thomas Dooms
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures. Code: this https URL

点击查看摘要

Abstract:Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one of empirical approximation.

[LG-1] Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction

链接: https://arxiv.org/abs/2605.15157
作者: Zhuohang Li,Liqun Huang,Wei Xu,Zhengming Zhu,Nie Lin,Xiao Ma,Xinjun Sheng,Ruoshi Wen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human takeover data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the takeover moment, which causes abrupt robot-hand configuration changes, or “gesture jumps”. We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with direct teleoperation takeover, HandITL reduces takeover jitter by 99.8% and preserves robust post-takeover manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect intervention data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

[LG-2] raining ML Models with Predictable Failures

链接: https://arxiv.org/abs/2605.15134
作者: Will Schwarzer,Scott Niekum
类目: Machine Learning (cs.LG)
*备注: 32 pages, 9 figures

点击查看摘要

Abstract:Estimating how often an ML model will fail at deployment scale is central to pre-deployment safety assessment, but a feasible evaluation set is rarely large enough to observe the failures that matter. Jones et al. (2025) address this by extrapolating from the largest k failure scores in an evaluation set to predict deployment-scale failure rates. We give a finite-k decomposition of this estimator’s forecast error and show that it has a built-in bias toward over-prediction in the typical case, which is the safety-favorable direction. This bias is offset when the evaluation set misses a rare high-failure mode that the deployment set contains, leaving the forecast to under-predict at deployment scale. We propose a fine-tuning objective, the forecastability loss, that addresses this failure mode. In two proof-of-concept experiments, a language-model password game and an RL gridworld, fine-tuning substantially reduces held-out forecast error while preserving primary-task capability and achieving safety similar to that of supervised baselines.

[LG-3] Causal Foundation Models with Continuous Treatments

链接: https://arxiv.org/abs/2605.15133
作者: Christopher Stith,Medha Barath,Vahid Balazadeh,Jesse C. Cresswell,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注: 22 pages, 9 figures

点击查看摘要

Abstract:Causal inference, estimating causal effects from observational data, is a fundamental tool in many disciplines. Of particular importance across a variety of domains is the continuous treatment setting, where the variable of intervention has a continuous range. This setting is far less explored and represents a substantial shift from the binary treatment setting, with models needing to represent effects across a continuum of treatment values. In this paper, we present the first causal foundation model for the continuous treatment setting. Our model meta-learns the ability to predict causal effects across a wide variety of unseen tasks without additional training or fine-tuning. First, we design a novel prior over data-generating processes with continuous treatment variables in order to generate a rich causal training corpus. We then train a transformer to reconstruct individual treatment-response curves given only observational data, leveraging in-context learning to amortize expensive Bayesian posterior inference. Our model achieves state-of-the-art performance on individual treatment-response curve reconstruction tasks compared to causal models which are trained specifically for those tasks.

[LG-4] Natural Synthesis: Outperforming Reactive Synthesis Tools with Large Reasoning Models

链接: https://arxiv.org/abs/2605.15131
作者: Frederik Schmitt,Matthias Cosler,Niklas Metzger,Julian Siber,Vladimir Krsmanovic,Mohamed Ghanem,Bernd Finkbeiner
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reactive synthesis, the problem of automatically constructing a hardware circuit from a logical specification, is a long-standing challenge in formal verification. It is elusive for two reasons: It is algorithmically hard, and writing formal specifications by hand is notoriously difficult. In this paper, we tackle both sides of the problem. For the algorithmic side, we present a neuro-symbolic approach to reactive synthesis that couples large reasoning models with model checkers to iteratively repair a synthesized Verilog implementation via sound symbolic feedback. Our approach solves more benchmarks than the best dedicated tools in the annual synthesis competition and extends to constructing parameterized systems, a problem known to be undecidable. On the specification side, we introduce an autoformalization step that shifts the specification task from temporal logic to natural language by introducing a hand-authored dataset of natural-language specifications for evaluation. We demonstrate performance comparable to that of starting from formal specifications, establishing natural synthesis as a viable end-to-end workflow.

[LG-5] CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic Contact-Rich Scenarios

链接: https://arxiv.org/abs/2605.15122
作者: Michael Baumgartner,David Müller,Agon Serifi,Ruben Grandia,Espen Knoop,Markus Gross,Moritz Bächer
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: RSS 2026

点击查看摘要

Abstract:Robust state estimation for highly dynamic motion of legged robots remains challenging, especially in dynamic, contact-rich scenarios. Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage. This paper presents CoCo-InEKF, a differentiable invariant extended Kalman filter that utilizes continuous contact velocity covariances instead of binary contact states. These learned covariances allow the method to dynamically modulate contact confidence, accounting for more nuanced conditions ranging from firm contact to directional slippage or no contact. To predict these covariances for a set of predefined contact candidate points, we employ a lightweight neural network trained end-to-end using a state-error loss. This approach eliminates the need for heuristic ground-truth contact labels. In addition, we propose an automated contact candidate selection procedure and demonstrate that our method is insensitive to their exact placement. Experiments on a bipedal robot demonstrate a superior accuracy-efficiency tradeoff for linear velocity estimation, as well as improved filter consistency compared to baseline methods. This enables the robust execution of challenging motions, including dancing and complex ground interactions – both in simulation and in the real world.

[LG-6] Learning from Language Feedback via Variational Policy Distillation

链接: https://arxiv.org/abs/2605.15113
作者: Yang Li,Erik Nijkamp,Semih Yavuz,Shafiq Rayhan Joty
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher’s zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher’s ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

[LG-7] An Interpretable Latency Model for Speculative Decoding in LLM Serving

链接: https://arxiv.org/abs/2605.15051
作者: Linghao Kong,Megan Flynn,Michael Peng,Nir Shavit,Mark Kurtz,Alexandre Marques
类目: Machine Learning (cs.LG); Performance (cs.PF)
*备注: 10 pages, 8 figures

点击查看摘要

Abstract:Speculative decoding (SD) accelerates large language model (LLM) inference by using a smaller draft model to propose multiple tokens that are verified by a larger target model in parallel. While prior work demonstrates substantial speedups in isolated or fixed-batch settings, the behavior of SD in production serving systems remains poorly understood: request load varies over time, and effective batch size emerges from the serving system rather than being directly controlled or observed. In this work, we develop a simple and interpretable latency model for SD in LLM serving. We infer effective batch size from request rate using Little’s Law and decompose per-request demand into load-independent and load-dependent components for prefill, drafting, and verification. We validate our model using extensive measurements from vLLM across verifier and drafter model sizes, prefill and decode lengths, request rates, draft lengths, and acceptance probabilities. The model accurately describes observed latency, explains why speedups often diminish as server load increases, and characterizes how draft length, acceptance rate, and verifier-drafter size shape latency across serving conditions, with implications for configuring SD in deployed systems. We further show how the framework extends to mixture of experts models, where sparse expert activation changes the effective service costs across load regimes. Together, our results provide a structured framework for understanding SD in real LLM serving systems.

[LG-8] Separating Intrinsic Ambiguity from Estimation Uncertainty in Deep Generative Models for Linear Inverse Problems

链接: https://arxiv.org/abs/2605.15050
作者: Yuxin Guo,Dongrui Deng,Pulkit Grover
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recently, deep generative models have been used for posterior inference in inverse problems, including high-stakes applications in medical imaging and scientific discovery, where the uncertainty of a prediction can matter as much as the prediction itself. However, posterior uncertainty is difficult to interpret because it can mix ambiguity inherent to the forward operator with uncertainty propagated through inference. We introduce a structural decomposition of posterior uncertainty that isolates intrinsic ambiguity. A cascade formulation makes this ambiguity accessible for calibration analysis, enabling qualitative diagnostics and simulation-based calibration tests that reveal failure modes that remain hidden when models are selected by reconstruction quality alone. We first validate the approach on a Gaussian example with analytical posterior structure, then illustrate the decomposition on accelerated magnetic resonance imaging (MRI), and finally apply the calibration diagnostics to electroencephalography (EEG) source imaging.

[LG-9] opoPrimer: The Missing Topological Context in Forecasting Models

链接: https://arxiv.org/abs/2605.15035
作者: Zara Zetlin,Kayhan Moharreri,Maria Safi
类目: Machine Learning (cs.LG)
*备注: 29 pages, 16 figures

点击查看摘要

Abstract:We introduce TopoPrimer, a framework that makes the global topological structure of the series population an explicit input to any forecasting model. TopoPrimer improves accuracy across diverse domains, stabilizes forecasts under seasonal demand spikes, and closes the cold-start gap. Precomputed once per domain via persistent homology and spectral sheaf coordinates, TopoPrimer deploys per token for fully-trained models and as a lightweight adapter for pre-trained backbones. Of these two components, sheaf coordinates are the primary accuracy driver. Across four public benchmarks on Chronos and TimesFM, TopoPrimer consistently improves forecasting accuracy, with gains of up to 7.3% MSE on ECL. The topology advantage persists with near-identical magnitude across zero-shot and fine-tuned backbones, suggesting topology and per-series training capture complementary signals. The gains are most pronounced in difficult regimes. Under peak seasonal demand, classical and zero-shot models degrade by up to 50%, while TopoPrimer stays within 10%. At cold start with no item history, TopoPrimer reduces MAE by 27% over a topology-free baseline. Comments: 29 pages, 16 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.15035 [cs.LG] (or arXiv:2605.15035v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.15035 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zara Zetlin [view email] [v1] Thu, 14 May 2026 16:30:25 UTC (28,759 KB) Full-text links: Access Paper: View a PDF of the paper titled TopoPrimer: The Missing Topological Context in Forecasting Models, by Zara Zetlin and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-10] DeepTokenEEG Enhancing Mild Cognitive Impairment and Alzheimers Classification via Tokenized EEG Features

链接: https://arxiv.org/abs/2605.15009
作者: Thinh Nguyen-Quang,Minh Long Ngo,Ngoc-Son Nguyen,Nguyen Thanh Vinh,Huy-Dung Han,Bui Thanh Tung,Nguyen Quang Linh,Khuong Vo,Manoj Vishwanath,Hung Cao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The detection of Alzheimers disease (AD) is considered crucial, as timely intervention can improve patient outcomes. Electroencephalogram (EEG)-based diagnosis has been recognized as a non-invasive, accessible, and cost-effective approach for AD detection; however, it faces challenges related to data availability, accuracy of modern deep learning methods, and the time-consuming nature of expert-based interpretation. In this study, a novel lightweight and high-performance model, DeepTokenEEG, was designed for the diagnosis of AD and the classification of EEG signals from AD patients, individuals with other neurological conditions, and healthy subjects. Unlike traditional heavy-weight models, DeepTokenEEG ultilizes spatial and temporal tokenizer that effectively captures AD-related biomarkers in both temporal and frequency domain with only 0.29 million paramaters. Trained in a combined dataset of 274 subjects, including 180 AD cases, and 94 healthy controls, the proposed method achieves a maximum recorded accuracy of 100% on specific frequency bands, representing an improvement of 1.41-15.35% over state-of-the-art methods on the same dataset. These results indicate the potential of DeepTokenEEG for early detection and screening of AD, with promising applicability for deployment due to its compact size.

[LG-11] Distance-Matrix Wasserstein Statistics for Scalable Gromov–Wasserstein Learning

链接: https://arxiv.org/abs/2605.14981
作者: Ao Xu,Tieru Wu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Gromov–Wasserstein (GW) distances compare graphs, shapes, and point clouds through internal distances, without requiring a common coordinate system. This invariance is powerful, but discrete GW is a nonconvex quadratic optimal transport problem and is difficult to estimate at scale. We propose \emphDistance-Matrix Wasserstein (DMW), a hierarchy of Wasserstein statistics comparing laws of random finite distance matrices. Rather than optimizing a global point-level alignment, DMW samples n points from each space, records their pairwise distances, and transports the resulting matrix laws. We prove that DMW is a relaxation and lower bound of GW, and establish a reverse approximation inequality: the GW–DMW gap is controlled by the Wasserstein error of approximating each original measure with n samples. Hence population DMW converges to GW as sampled subspaces become dense. We further give finite-sample bounds, including intrinsic-dimensional rates that depend on the data manifold rather than the ambient matrix dimension \binom n2 . For scalable computation, we introduce sliced and multi-scale DMW; for p=1 , the sliced multi-scale dissimilarity yields positive-definite exponential kernels. Experiments on synthetic metric spaces, scalability benchmarks, graph classification, and two-sample testing validate the theory and demonstrate an interpretable GW-style proxy for structural comparison.

[LG-12] InfoSFT: Learn More and Forget Less with Information-Aware Token Weighting

链接: https://arxiv.org/abs/2605.14967
作者: Mahdi Sabbaghi,George Pappas,Adel Javanmard,Hamed Hassani
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Supervised fine-tuning (SFT) provides the standard approach for teaching LLMs new behaviors from offline expert demonstrations. However, standard SFT uniformly fits all samples – including those with low likelihood under the base model – which can disproportionately drive training updates toward overfitting specific samples rather than learning the target behavior. Moreover, adapting to these unlikely samples induces substantial policy shifts that degrade prior capabilities. Existing methods mitigate this by filtering, regenerating, or down-weighting low-likelihood data. In doing so, they often suppress precisely the novel behaviors the base model has yet to learn. We propose InfoSFT, a principled weighting scheme for the SFT objective that concentrates learning signals on maximally informative, medium-confidence tokens – those neither overly familiar to the base model nor too unlikely to cause instability. Requiring only a one-line modification to the standard token-wise loss, InfoSFT demonstrably improves generalization over vanilla SFT and likelihood-weighted baselines across math, code, and chain-of-thought tasks with diverse model families, while better preserving pre-existing capabilities. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2605.14967 [cs.LG] (or arXiv:2605.14967v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14967 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-13] Efficient Online Conformal Selection with Limited Feedback

链接: https://arxiv.org/abs/2605.14953
作者: Sreenivas Gollapudi,Kostas Kollias,Kamesh Munagala,Ali Sinop
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We address the problem of conformal selection, where an agent must select a minimal subset of options to ensure that at least one success'' is identified with a pre-specified target probability \phi . While traditional online conformal prediction focuses on maintaining validity for the observed sequence, minimizing the resource cost (efficiency) of such selections, especially under limited feedback, remains a significant challenge. In this work, we consider settings with the most limited bandit’’ feedback, and demonstrate that the simple Adaptive Conformal Inference (ACI) update rule, when applied to the appropriate control parameter or dual variable, is both adversarially valid, ensuring the success target is met on average for any input sequence (and hence under distribution shifts), and stochastically efficient, achieving sublinear efficiency regret for i.i.d. inputs against an appropriate stochastic benchmark. We show such guarantees under canonical models capturing bandit and semi-bandit feedback to the agent via a unifying algorithmic technique, and analytic framework involving Lyapunov functions. Our approach handles more complex settings than prior work, while requiring significantly less feedback, and our results provide a new theoretical bridge between efficient online learning with limited feedback and distribution-free uncertainty quantification.

[LG-14] A Hardware-Aware Per-Layer Methodology for Post-Training Quantization of Large Language Models

链接: https://arxiv.org/abs/2605.14929
作者: Earl Killian
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注: 21 pages

点击查看摘要

Abstract:Scaled Outer Product (SOP) is a post-training quantization methodology for large language model weights, designed to deliver near-lossless fidelity at 4.5–6 bits per weight on hardware with per-layer LUT decode. The methodology combines per-layer search of fixed and dynamic codebook pairs selected by a per-block selection bit, signed per-block scales, activation-weighted cosine selection, and multiple-choice knapsack promotion of sensitive layers with outlier and sparse-residual correction. Fixed codebooks include NF4, BOF4, Split87, and SH4; per-layer optimized codebooks (DD4) are hosted in LUT SRAM. A new hardware-efficient LUT output format (HIF) is proposed to improve performance, energy, and cost. Across six open model families, the recommended FP6 operating point (E2M3sUE4M4, 6.5 bpw) achieves lower weight reconstruction error than the conventional per-layer-POT FP8 baseline (E4M3, 8.0 bpw) at 1.5 bpw lower storage cost, demonstrating that block-scaled small atoms with carefully chosen scale precision can replace conventionally-deployed FP8. Full evaluation across the 4.5–6 bpw range, including layer promotion and sparse residual correction, is reported in a companion paper.

[LG-15] Learning with Shallow Neural Networks on Cluster-Structured Features

链接: https://arxiv.org/abs/2605.14927
作者: Elisabetta Cornacchia,Laurent Massoulié
类目: Machine Learning (cs.LG)
*备注: 10 pages main body, 2 figures

点击查看摘要

Abstract:The success of deep learning in high-dimensional settings is often attributed to the presence of low-dimensional structure in real-world data. While standard theoretical models typically assume that this structure lies in the target function, projecting unstructured inputs onto a low-dimensional subspace, data such as images, text or genomic sequences exhibit strong spatial correlations within the input space itself. In this paper, we propose a tractable model to study how these correlations affect the sample complexity of learning with gradient descent on shallow neural networks. Specifically, we consider targets that depend on a small number of latent Boolean variables, and input features grouped into clusters and correlated with the latent variables. Under an identifiability assumption, we show that for a layerwise gradient-descent variant, the sample complexity scales with the number of hidden variables and, when the signal-to-noise ratio is sufficiently high, is independent of the input dimension, up to logarithmic terms. We empirically test our theoretical findings on both synthetic and real data.

[LG-16] A Mutual Information Lower Bound for Multimodal Regression Active Learning

链接: https://arxiv.org/abs/2605.14917
作者: Leonardo Ferreira Guilhoto,Akshat Kaushal,Paris Perdikaris
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Active learning for continuous regression has lacked an acquisition function that targets epistemic uncertainty when the predictive distribution is multimodal: variance misses modal disagreement, and information-theoretic targets like BALD are designed for discrete outputs. We introduce a Two-Index framework that makes this separation explicit: one stochastic index selects among competing model hypotheses (epistemic source), while a second governs within-hypothesis randomness (aleatoric source). An entropy decomposition within the framework identifies the mutual information between the output and the epistemic index as a principled acquisition objective, and we prove this quantity vanishes as the model is trained on growing datasets, confirming that it captures exactly the uncertainty data can resolve. Because this mutual information is intractable for continuous outputs, we derive the Mutual Information Lower Bound (MI-LB) acquisition function, a closed-form approximation for Mixture Density Network ensembles. On benchmarks featuring multimodal systems, MI-LB matches or beats every baseline evaluated and is the only method to do so consistently – geometric and Fisher-based baselines compete only when the input space already encodes the multimodality, and collapse otherwise.

[LG-17] ILBench: A Systematic Benchmark for Tabular Imbalanced Learning Across Data Regimes

链接: https://arxiv.org/abs/2605.14915
作者: Ruizhe Liu,Jiaqi Luo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Imbalanced learning remains a fundamental challenge in tabular data applications. Despite decades of research and numerous proposed algorithms, a systematic empirical understanding of how different imbalanced learning methods behave across diverse data characteristics is still lacking. In particular, it remains unclear how different method families compare in predictive performance, robustness under varying data characteristics, and computational scalability. In this work, we present Tabular Imbalanced Learning Benchmark (TILBench), a large-scale empirical benchmark for tabular imbalanced learning. TILBench evaluates more than 40 representative algorithms across 57 diverse tabular datasets, resulting in over 200000 controlled experiments across a wide range of data characteristics. Our findings show that no single method consistently dominates across all settings; instead, the effectiveness of imbalanced learning methods depends strongly on dataset characteristics and computational constraints. Based on these findings, we provide practical recommendations for selecting appropriate methods in real-world applications.

[LG-18] xt-Dependent Speaker Verification (TdSV) Challenge 2024: Team Naive System Report

链接: https://arxiv.org/abs/2605.14896
作者: Amir Mohammad Rostami,Pourya Jafarzadeh
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a system for the 2024 Text-Dependent Speaker Verification (TdSV) Challenge. The system achieved a Minimum Detection Cost Function (MinDCF) of 0.0461 and an Equal Error Rate (EER) of 1.3%. Our approach focused on adapting existing state-of-the-art neural networks, ResNet-TDNN and NeXt-TDNN, originally trained on the VoxCeleb dataset. This strategy was chosen because of the limited challenge duration and the available resources at the time. In addition, we designed a lightweight and resource-efficient model, EfficientNet-A0, trained specifically on the challenge dataset to improve adaptation and strengthen the ensemble approach. Our system combines advanced neural architectures, extensive data augmentation, and optimised hyperparameters. These components helped achieve strong performance in text-dependent speaker verification. The results also demonstrate the effectiveness of multi-model ensemble learning for both speaker and phrase verification.

[LG-19] PROCESS-2: A Benchmark Speech Corpus for Early Cognitive Impairment Detection

链接: https://arxiv.org/abs/2605.14888
作者: Madhurananda Pahar,Caitlin H. Illingworth,Bahman Mirheidari,Hend Elghazaly,Fritz Peters,Sophie Young,Wing-Zin Leung,Labhpreet Kaur,Daniel Blackburn,Heidi Christensen
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Speech-based analysis offers a scalable and non-invasive approach for detecting cognitive decline, yet progress has been constrained by the limited availability of clinically validated datasets collected under realistic conditions. We introduce PROCESS-2, a large-scale speech dataset designed to support research on automatic assessment of cognitive impairment from spontaneous and task-oriented speech. The dataset comprises recordings from 200 healthy controls, 150 mild cognitive impairment, and 50 dementia diagnoses collected using the CognoMemory digital assessment platform. Each participant completed a single assessment session, including picture description and verbal fluency tasks, accompanied by manually verified transcripts and participant-level metadata. PROCESS-2 contains approximately 21 hours of speech audio with predefined train/test partitions. Comprehensive technical validation evaluated demographic balance, clinical consistency, recording stability, embedding-space structure, and reproducible baseline modelling performance, demonstrating clinically meaningful group separation and stable performance across modelling approaches while preserving real-world conversational variability. PROCESS-2 is released under controlled access via Hugging Face to enable responsible reuse while protecting participant privacy, providing a reproducible benchmark resource for speech-based cognitive assessment research.

[LG-20] AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks

链接: https://arxiv.org/abs/2605.14884
作者: Magdalena Proszewska,N. Siddharth
类目: Machine Learning (cs.LG)
*备注: 19 pages,4 figures, 8 tables

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.

[LG-21] Fast Adversarial Attacks with Gradient Prediction

链接: https://arxiv.org/abs/2605.14868
作者: Kamil Ciosek,Aleksandr V. Petrov,Nicolò Felicioni,Konstantina Palla
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Generating adversarial examples at scale is a core primitive for robustness evaluation, adversarial training, and red-teaming, yet even “fast” attacks such as FGSM remain throughput-limited by the cost of a backward pass. We introduce a family of attacks that eliminates the backward pass by predicting the input gradient from forward-pass hidden states via a lightweight linear regression. The approach is motivated by a kernel view of neural networks and is exact in the Neural Tangent Kernel regime, while remaining effective for practical finite-width models. Empirically, our methods recover much of FGSM’s attack performance while using only a small fraction of the time, corresponding to a 532% increase in throughput. These results suggest gradient prediction as a simple and general route to significantly faster adversarial generation under realistic wall-clock constraints.

[LG-22] In-Context Learning for Data-Driven Censored Inventory Control

链接: https://arxiv.org/abs/2605.14840
作者: Sohom Mukherjee,Anh-Duy Pham,Richard Pibernik,Yunbei Xu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study inventory control with decision-dependent censoring, focusing on the censored or repeated newsvendor (R-NV), where each order quantity determines whether demand is fully observed or censored by sales. Existing approaches based on parametric Thompson sampling (TS) can be brittle under prior mismatch, while offline imputation methods need not transfer to online learning. Motivated by the predictive view of decision making, we combine these ideas by taking oracle actions on learned completions of latent demand. We propose in-context generative posterior sampling (ICGPS), which uses modern generative models that are meta-trained offline and deployed online by in-context autoregressive generation. Theoretically, we show that the Bayesian regret of ICGPS with a learned completion kernel is bounded by the Bayesian regret of a TS benchmark with the ideal completion kernel plus a deployment penalty scaling as \sqrtT times the square root of the completion mismatch. This yields a plug-in template for operational problems with known TS regret bounds. For R-NV, we derive sublinear Bayesian regret by reducing censored feedback to bandit convex optimization feedback. We also show that, under reasonable coverage and stability assumptions, the online completion mismatch is controlled by the offline censored predictive mismatch, so offline predictive quality transfers to online performance. Practically, we instantiate ICGPS with ChronosFlow, which combines a frozen time-series transformer backbone with a trainable conditional normalizing-flow head for fast censoring-consistent sampling. In benchmark experiments, ChronosFlow-ICGPS matches correctly specified TS, outperforms myopic and UCB-style baselines, and is robust to prior mismatch and distribution shift. ChronosFlow-ICGPS also performs well for the real-world SuperStore dataset, especially under heavy censoring.

[LG-23] GenAI for Energy-Efficient and Interference-Aware Compressed Sensing of GNSS Signals on a Google Edge TPU

链接: https://arxiv.org/abs/2605.14839
作者: Thorben Wegner,Lucas Heublein,Tobias Feigl,Felix Ott,Christopher Mutschler,Alexander Rügamer
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 12 pages

点击查看摘要

Abstract:Traditional methods for classifying global navigation satellite system (GNSS) jamming signals typically involve post-processing raw or spectral data streams, requiring complex and costly data transmission to cloud-based interference classification systems. In contrast, our proposed approach efficiently compresses GNSS data streams directly at the hardware receiver while simultaneously classifying jamming and spoofing attacks in real time. Given the growing prevalence of GNSS jamming, there is a critical need for real-time solutions suitable for power-constrained environments. This paper introduces a novel method for compressing and classifying GNSS jamming threats using generative artificial intelligence (GenAI), specifically variational autoencoders (VAEs), deployed on Google Edge tensor processing units (TPUs). The study evaluates various autoencoder (AE) architectures to compress and reconstruct GNSS signals, focusing on preserving interference characteristics while minimizing data size near the receiver hardware. The pipeline adapts large-scale AE models for Google Edge TPUs through 8-bit quantization to ensure energy-efficient deployment. Tests on raw in-phase and quadrature-phase (IQ) data, Fast Fourier Transform (FFT) data, and handcrafted features show the system achieves significant compression (42x) and accurate classification of approximately 72 interference types on reconstructed signals (F2-score 0.915), closely matching the original signals (F2-score 0.923). The hardware-centric GenAI approach also substantially reduces jammer signal transmission costs, offering a practical solution for interference mitigation. Ablation studies on conditional and factorized VAEs (i.e., FactorVAE) explore latent feature disentanglement for data generation, enhancing model interpretability and fostering trust in machine learning (ML) solutions for sensitive interference applications.

[LG-24] oMAToMP: Robust and Multi-Parameter Topological Clustering

链接: https://arxiv.org/abs/2605.14824
作者: Ludo Andrianirina,Mathieu Carrière
类目: Machine Learning (cs.LG); Algebraic Topology (math.AT)
*备注:

点击查看摘要

Abstract:Topological clustering, and its main algorithm ToMATo, is a clustering method from Topological Data Analysis (TDA) which has been applied successfully in several applications during the last few years. This is due to its high versatility, as clusters are detected from the persistent components in the sublevel sets of any user-defined function (gene expression, pixel values, etc), and efficiency, as topological clustering enjoys robustness guarantees. However, ToMATo is also limited in several ways. First, a graph on the data points needs to be provided as a hyper-parameter of the method (whose fine-tuning is left to the user). Second, ToMATo is known to be very sensitive to outlier values in the function range. Finally, and most importantly, ToMATo can only handle one function at a time, whereas it is critical to use several functions in various applications. In this article, we introduce ToMAToMP: the first topological clustering method able to handle several functions at the same time with theoretical guarantees. More specifically, we leverage a recent tool from multi-parameter persistent homology, called MMA decomposition, to design our clustering algorithm, and prove that it enjoys robustness properties. As corollaries, we show that it can be used to make ToMATo independent of graph tuning, and robust to outliers. Finally, we provide a set of numerical experiments showcasing the efficiency and quality of the clusterings produced by ToMAToMP, by showing strong improvement over non-topological and topological baselines for various datasets.

[LG-25] GFMate: Empowering Graph Foundation Models with Test-time Prompt Tuning

链接: https://arxiv.org/abs/2605.14809
作者: Yan Jiang,Ruihong Qiu,Zi Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph prompt tuning has shown great potential in graph learning by introducing trainable prompts to enhance the model performance in conventional single-domain scenarios. Recent research has extended graph prompts to improve Graph Foundation Models (GFMs) by few-shot tuning auxiliary prompts. Despite their progress, most existing methods embed source-domain information into prompts, which serve either as input to GFMs or encoded during model pre-training. Such prompt entanglement with specific source domains and GFM pre-training strategy restricts their generalisability to other domains and different GFMs. Furthermore, existing GFM prompts merely rely on few-shot tuning for adaptation, neglecting the rich information in unlabelled target domain test data. Motivated by these insights, this paper aims to empower GFMs with pre-training-agnostic test-time graph prompt tuning, named GFMate. GFMate introduces centroid and layer prompts applied after pre-training on target domains, avoiding entanglement with specific source domains and model pre-training. In addition, a test-time complementary learning objective is devised to exploit both labelled and unlabelled target domain data for effective test-time prompt tuning. Extensive experiments on 12 benchmark datasets demonstrate the superior performance and efficiency of GFMate, achieving improvements of up to 30.63%. Code is available at this https URL.

[LG-26] Pengs Q(λ) for Conservative Value Estimation in Offline Reinforcement Learning ICLR2026

链接: https://arxiv.org/abs/2605.14779
作者: Byeongchan Kim,Min-hwan Oh
类目: Machine Learning (cs.LG)
*备注: Accepted in ICLR 2026

点击查看摘要

Abstract:We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng’s Q( \lambda ) (CPQL). Our algorithm adapts the Peng’s Q( \lambda ) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textitmulti-step operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees – a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at this https URL.

[LG-27] Composable Crystals: Controllable Materials Discovery via Concept Learning

链接: https://arxiv.org/abs/2605.14769
作者: Nian Liu,Yuwei Zeng,Ryoji Kubo,Nikita Kazeev,Stephen Gregory Dale,Artem Maevskiy,Pengru Huang,Thomas Laurent,Kostya S. Novoselov,Xavier Bresson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:De novo crystal generation, a central task in materials discovery, aims to generate crystals that are simultaneously valid, stable, unique, and novel. Existing methods mainly rely on black-box stochastic sampling, providing limited control over how generated structures move beyond the observed distribution. In this paper, we introduce a concept-based compositional framework for crystal generation. We train a vector-quantized variational autoencoder to automatically discover a shared set of reusable crystal concepts, which serve as building blocks for guided generation. These learned concepts naturally exhibit interpretability from both local atomic environments and global symmetry patterns, and generalize to crystals from different distributions. By recombining such concepts, our framework enables controllable exploration of novel crystals beyond the training distribution, rather than relying solely on unconstrained random sampling. To further improve composition efficiency, we introduce a composition generator and iteratively refine it using high-quality samples generated by the model itself. The resulting concept compositions are then used to condition downstream crystal generation. Numerical experiments on MP-20 and Alex-MP-20 show that compositing concepts separately increase base model up to 53.2% and 51.7% on V.S.U.N metric, with particular gains in novelty.

[LG-28] Crys-JEPA: Accelerating Crystal Discovery via Embedding Screening and Generative Refinement

链接: https://arxiv.org/abs/2605.14759
作者: Nian Liu,Nikita Kazeev,Stephen Gregory Dale,Artem Maevskiy,Yuwei Zeng,Ryoji Kubo,Pengru Huang,Thomas Laurent,Yann LeCun,Kostya S. Novoselov,Xavier Bresson
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:De novo crystal generation seeks to discover materials that are not merely realistic, but also stable and novel. However, most existing generative models are trained to maximize the likelihood of observed crystals, which encourages samples to stay close to known materials yet not necessarily align with the criteria that matter in discovery. Through an empirical investigation, we show that current crystal generative models are caught in a pronounced stability–novelty trade-off: moving toward the observed distribution preserves stability but limits novelty, whereas moving away from it quickly destroys stability. This suggests that the useful region for discovering crystals that are both stable and novel is extremely narrow. To escape the trade-off, we introduce Crys-JEPA, a joint embedding predictive architecture for crystals that learns an energy-aware latent space preserving formation-energy differences. In this space, stability assessment can be reformulated as an embedding-based comparison against accessible training crystals, reducing the reliance on expensive energy evaluation and task-specific external references. Building on Crys-JEPA, we further develop a screening-and-refinement pipeline that identifies promising generated crystals and reintroduces them to refine the generative model. On MP-20 and Alex-MP-20 datasets, we achieve improvements over baselines up to 81.4% and 82.6% on V.S.U.N metric, respectively.

[LG-29] Selective Safety Steering via Value-Filtered Decoding

链接: https://arxiv.org/abs/2605.14746
作者: Bat-Sheva Einbinder,Hen Davidov,Yee Whye Teh,Yarin Gal,Yaniv Romano
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:While large language models (LLMs) are trained to align with human values, their generations may still violate safety constraints. A growing line of work addresses this problem by modifying the model’s sampling policy at decoding time using a safety reward. However, existing decoding-time steering methods often intervene unnecessarily, modifying generations that would have been safe under the base model. Such unnecessary interventions are undesirable, as they can distort key properties of the base model such as helpfulness, fluency, style, and coherence. We propose a new test-time steering method designed to reduce such unnecessary interventions while improving the safety of unsafe responses. Our approach filters tokens using a value-based safety criterion and provides an explicit bound on the probability of false interventions. A single threshold hyperparameter controls this bound, allowing practitioners to trade off higher rates of unnecessary intervention for better output safety. Across multiple datasets and experiments, we show that our value-filtered decoding method outperforms existing baselines, achieving better trade-offs between safety, helpfulness, and similarity to the base model.

[LG-30] IsoNet: Spatially-aware audio-visual target speech extraction in complex acoustic environments

链接: https://arxiv.org/abs/2605.14736
作者: Dinanath Pathya,Sajen Maharjan,Binita Adhikari,Ishwor Raj Pokharel
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:Target speech extraction remains difficult for compact devices because monaural neural models lack spatial evidence and classical beamformers lose resolving power when the microphone aperture is only a few centimetres. We present IsoNet, a user-selectable audio-visual target speech extraction system for a compact 4-microphone array. IsoNet combines complex multi-channel STFT features, GCC-PHAT spatial cues, face-conditioned visual embeddings, and auxiliary direction-of-arrival supervision inside a U-Net mask estimation network. Three curriculum variants were trained on 25,000 simulated VoxCeleb mixtures with progressively difficult SNR regimes. On a hard test set spanning -1 to 10 dB SNR, IsoNet-CL1 achieves 9.31 dB SI-SDR, a 4.85 dB improvement over the mixture, with PESQ 2.13 and STOI 0.84. Oracle delay-and-sum and MVDR beamformers degrade the same mixtures by 4.82 dB and 6.08 dB SI-SDRi, respectively, showing that the proposed learned multimodal conditioning solves a regime where conventional spatial filtering is ineffective. Ablation studies show consistent gains from visual conditioning, GCC-PHAT features, and extended delay-bin encoding. The results establish a compact-array, face-selectable speech extraction baseline under controlled simulation and identify the remaining barriers to real deployment, especially phase reconstruction, multi-interferer mixtures, and simulation-to-real transfer.

[LG-31] he Rate-Distortion-Polysemanticity Tradeoff in SAEs

链接: https://arxiv.org/abs/2605.14694
作者: Tommaso Mencattini,Francesco Montagna,Francesco Locatello
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations, introducing the Rate-Distortion-Polysemanticity tradeoff in SAEs. Under toy-modeling assumptions, we theoretically and empirically show that restricting the SAE to be monosemantic necessarily comes with an increase in rate and distortion. Assuming a generative model behind the input observations, we further demonstrate that the degree of polysemanticity of optimal SAEs is determined by the training data distribution, especially by the probability of features to co-occur. Finally, we extend the analysis to real-world settings by deriving necessary conditions that a polysemanticity measure should satisfy when the data-generating process is unknown, and we benchmark existing proxy metrics on SAEs trained on Large Language Models. Taken together, our findings show that polysemanticity is a data problem that should be accounted for when addressing it at the architectural and optimization level.

[LG-32] ReMIA: a Powerful and Efficient Alternative to Membership Inference Attacks against Synthetic Data Generators

链接: https://arxiv.org/abs/2605.14686
作者: Davide Scassola,Andrea Coser,Sebastiano Saccani
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Tabular data sharing under privacy constraints is increasingly important for research and collaboration. Synthetic data generators (SDGs) are a promising solution, but synthetic data remains vulnerable to attacks, such as membership inference attacks (MIAs), which aim to determine whether a specific record was part of the training data. State-of-the-art MIAs are powerful but impractical: they rely on shadow modeling, requiring hundreds of SDG training runs, and need auxiliary data several times larger than the original training set. Fast proxy metrics like distance to closest record (DCR) are efficient but have limited sensitivity to MIA risk. We introduce ReMIA (Relative Membership Inference Attack), a practical privacy metric that requires only two SDG training runs and additional data no larger than the original training set. Rather than predicting whether a record was in the training set, ReMIA generates two synthetic datasets from two source datasets and measures whether a classifier can identify which source a record came from. Experiments across multiple tabular datasets and SDGs show that ReMIA has a sensitivity comparable to state-of-the-art MIAs while being substantially more practical. We further observe that SDGs can achieve privacy-utility trade-offs that traditional noise-based anonymization methods do not match. Code is available at this https URL.

[LG-33] AQKA: Active Quantum Kernel Acquisition Under a Shot Budget

链接: https://arxiv.org/abs/2605.14672
作者: Jian Xu,Chao Li,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating an N \times N quantum kernel from circuit fidelities requires \Theta(N^2 S) measurement shots, the dominant bottleneck for deployment on near-term hardware. Existing budget-saving methods (Nyström-QKE, ShoFaR, kernel-target alignment) sub-sample \emphwhich entries to measure but allocate shots \emphuniformly within their chosen subset, ignoring how much each entry drives the downstream classifier. We close this gap with two contributions. \textbfFirst, a complete regime decomposition for shot-budgeted quantum kernel learning: a principled menu of when each allocator wins. Our method, \emphAQKA, dominates the budget-limited regime ( B \lesssim 16 n_\mathrmpairs ) on sparse-sensitivity KRR, with the gap \emphgrowing from +8 to +25 pts over uniform as N scales 225\to1000 and reaching +26 – 32 pts on an \textttibm_pittsburgh (156-qubit Heron) hardware kernel; Nyström-QKE wins at saturating budgets on planted-sparse via low-rank reconstruction; ShoFaR is competitive only at extreme low budgets. \textbfSecond, a closed-form pair-level acquisition theory: s_ij^\star \propto |g_ij|\sqrtK_ij(1-K_ij) with explicit gradient g_ij for KRR (Lemma~1, |\beta_i\alpha_j+\beta_j\alpha_i|\sqrtK_ij(1-K_ij) ) and SVM via the envelope theorem ( |\eta_i^\eta_j^|\sqrtK_ij(1-K_ij) ); a \emphcorrected sparsity-aware Cauchy–Schwarz rate \rho \le 2m/N matching empirics (vs.\ the naive m^2/N^2 ); an explicit-constant plug-in regret bound (Theorem~2); and a tighter SVM ceiling \rho^\mathrmSVM \le m_\mathrmsv^2/N^2 . We close with the first multi-seed live online adaptive shot allocation on quantum hardware: +17.0 \pm 4.8 pts at N=20 on \textttibm_aachen ( 3.5\sigma , 5 seeds), with the advantage holding at N=30 at higher budget on \textttibm_berlin ( +14.0 \pm 8.5 pts, 5 seeds).

[LG-34] Slower Generalization Faster Memorization: A Sweet Spot in Algorithmic Learning

链接: https://arxiv.org/abs/2605.14659
作者: Shin So,Kyelim Lee,Albert No
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Critical-data-size accounts of grokking suggest a natural post-threshold intuition: once training data is sufficient to identify the underlying rule, additional data should accelerate validation convergence. We show that this intuition can fail in a controlled structured-output task. In Needleman–Wunsch (NW) matrix generation, small Transformers reach high validation exact-match accuracy fastest at an intermediate dataset size, not at the largest one. Past this dataset-size sweet spot, generalization remains achievable but requires more gradient updates. Conversely, in the regime where partial validation competence first appears, larger datasets can require fewer updates to reach high training accuracy, suggesting that emerging rule structure can accelerate fitting beyond example-wise memorization. A multiplication baseline does not show the same post-threshold slowdown. These results separate the critical data size for the onset of generalization from the dataset size that optimizes update-based convergence, and identify structured-output tasks where learning the rule and completing exact-fitting can diverge.

[LG-35] Unbiased and Second-Order-Free Training for High-Dimensional PDEs ICML2026

链接: https://arxiv.org/abs/2605.14643
作者: Jaemin Seo,Surin Lee,Jae Yong Lee
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
*备注: Accepted at ICML 2026

点击查看摘要

Abstract:Deep learning methods based on backward stochastic differential equations (BSDEs) have emerged as competitive alternatives to physics-informed neural networks (PINNs) for solving high-dimensional partial differential equations (PDEs). By leveraging probabilistic representations, BSDE approaches can avoid the curse of dimensionality and often admit second-order-free training objectives that do not require explicit Hessian evaluations. It has recently been established that the commonly used Euler-Maruyama (EM) time discretization induces an intrinsic bias in BSDE training losses. While high-order schemes such as Heun can fully eliminate this bias, such schemes re-introduce second-order spatial derivatives and incur substantial computational overhead. In this work, we provide a principled analysis of EM-induced loss bias and propose an unbiased, second-order-free training framework that preserves the computational advantages of BSDE methods. Our code is available at this https URL.

[LG-36] DRL-STAF: A Deep Reinforcement Learning Framework for State-Aware Forecasting of Complex Multivariate Hidden Markov Processes

链接: https://arxiv.org/abs/2605.14632
作者: Manrui Jiang,Jingru Huang,Yong Chen,Chen Zhang
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Forecasting multivariate hidden Markov processes is challenging due to nonlinear and nonstationary observations, latent state transitions, and cross-sequence dependencies. While deep learning methods achieve strong predictive accuracy, they typically lack explicit state modeling, whereas Hidden Markov Models (HMMs) provide interpretable latent states but struggle with complex nonlinear emissions and scalability. To address these limitations, we propose DRL-STAF, a Deep Reinforcement Learning based STate-Aware Forecasting framework that jointly predicts next-step observations and estimates the corresponding hidden states for complex multivariate hidden Markov processes. Specifically, DRL-STAF models complex nonlinear emissions using deep neural networks and estimates discrete hidden states using reinforcement learning, reducing the reliance on predefined transition structures and enabling flexible adaptation to diverse temporal dynamics. In particular, DRL-STAF mitigates the state-space explosion encountered by typical multivariate HMM-based methods. Extensive experiments demonstrate that DRL-STAF outperforms HMM variants, standalone deep learning models, and existing DL-HMM hybrids in most cases, while also providing reliable hidden-state estimates.

[LG-37] Silent Collapse in Recursive Learning Systems

链接: https://arxiv.org/abs/2605.14588
作者: Zhipeng Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recursive learning – where models are trained on data generated by previous versions of themselves – is increasingly common in large language models, autonomous agents, and self-supervised systems. However, standard performance metrics (loss, perplexity, accuracy) often fail to detect internal degradation before it becomes irreversible. Here we identify a phenomenon we call silent collapse: under broad recursive conditions, model internal distributions – predictive entropy, representational diversity, and tail coverage – progressively contract even as conventional metrics appear stable or improving. We discover that silent collapse is not abrupt. Its onset is reliably preceded by three trajectory-level precursors: (1) contraction of anchor entropy, (2) freezing of representation drift, and (3) erosion of tail coverage. These signals manifest multiple generations before any degradation in standard validation metrics, enabling early warning. Based on these precursors, we propose the MTR (Monitor–Trust–Regulator) framework, a lightweight metacognitive loop that monitors trajectory statistics, estimates a slow-timescale trust variable, and adaptively modulates the effective learning intensity. MTR provides early warning and actively prevents silent collapse without requiring access to pristine real data – a critical advantage when original data is unavailable, contaminated, or private. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.14588 [cs.LG] (or arXiv:2605.14588v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14588 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Woodelf: A Fast and Unified Partial Dependence Plot Algorithm for Decision Tree Ensembles IJCAI2026

链接: https://arxiv.org/abs/2605.14578
作者: Ron Wettenstein,Alexander Nadel,Udi Boker
类目: Machine Learning (cs.LG)
*备注: Extended version of the paper to appear at IJCAI 2026

点击查看摘要

Abstract:Partial Dependence Plots (PDPs) visualize how changes in a single feature affect the average model prediction. They are widely used in practice to interpret decision tree ensembles and other machine learning models. Joint-PDPs extend this idea to pairs of features, revealing their combined effect. Partial Dependence Interaction Values (PDIVs) measure feature interactions. The Any-Order-PDIVs task computes these interactions for every feature subset across all rows of the dataset. We introduce Woodelf++, a unified and efficient approach for computing all these useful explainability tools on decision tree ensembles, building on Woodelf, an algorithm for efficient SHAP computation. By deriving suitable metrics over pseudo-Boolean functions, Woodelf++ can compute PDPs (exact and approximate), Joint-PDPs, and Any-Order-PDIVs in a unified framework. Our method delivers substantial complexity improvements over the state of the art, including an exponential gain for Any-Order-PDIVs. Additionally, we introduce and efficiently compute Full PDPs, which leverage the model’s split thresholds to faithfully capture its behavior across all possible feature values. Woodelf++ is implemented in pure Python and supports GPU acceleration. On a dataset with 400,000 rows, Woodelf++ computes PDP and Joint-PDP up to 6x faster than the state of the art and up to five orders of magnitude faster than scikit-learn. For Any-Order-PDIVs, the gap is even larger: Woodelf++ computes all interaction values in 5 minutes, while the state of the art is estimated to require over 1,000,000 years. Comments: Extended version of the paper to appear at IJCAI 2026 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.14578 [cs.LG] (or arXiv:2605.14578v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14578 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Let Robots Feel Your Touch: Visuo-Tactile Cortical Alignment for Embodied Mirror Resonance

链接: https://arxiv.org/abs/2605.14571
作者: Tianfang Zhu,Ning An,Rui Wang,Jiasi Gao,Qingming Luo,Anan Li,Guyue Zhou
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Observing touch on another’s body can elicit corresponding tactile sensations in the observer, a phenomenon termed mirror touch that supports empathy and social perception. This visuo-tactile resonance is thought to rely on structural correspondence between visual and somatosensory cortices, yet robotic systems lack computational frameworks that instantiate this principle. Here we demonstrate that cortical correspondence can be operationalized to endow robots with mirror touch. We introduce Mirror Touch Net, which imposes semantic, distributional and geometric alignment between visual and tactile representations through multi-level constraints, enabling prediction of millimetre-scale tactile signals across 1,140 taxels on a robotic hand from RGB images. Manifold analysis reveals that these constraints reshape visual representations into geometry consistent with the tactile manifold, reducing the complexity of cross-modal mapping. Extending this alignment framework to cross-domain observations of human hands enables tactile prediction and reflexive responses to observed human touch. Our results link a neural principle of visuo-tactile resonance to robotic perception, providing an explainable route towards anticipatory touch and empathic human-robot interaction. Code is available at this https URL.

[LG-40] SeesawNet: Towards Non-stationary Time Series Forecasting with Balanced Modeling of Common and Specific Dependencies IJCAI ECAI2026

链接: https://arxiv.org/abs/2605.14551
作者: Hao Li,Lu Zhang,Liu Chong,Yankai Chen,Pengyang Wang,Yingjie Zhou
类目: Machine Learning (cs.LG)
*备注: Accepted by IJCAI-ECAI 2026, the 35th International Joint Conference on Artificial Intelligence. Code is at this https URL

点击查看摘要

Abstract:Instance normalization (IN) is widely used in non-stationary multivariate time series forecasting to reduce distribution shifts and highlight common patterns across samples. However, IN can over-smooth instance-specific structural information that is essential for modeling temporal and cross-channel heterogeneity. While prior methods further suppress distribution discrepancies or attempt to recover temporal specific dependencies, they often ignore a central tension: how to adaptively model common and instance-specific dependency based on each instance’s non-stationary structures. To address this dilemma, we propose SeesawNet, a unified architecture that dynamically balances common and instance-specific dependency modeling in both temporal and channel dimensions. At its core is Adaptive Stationary-Nonstationary Attention (ASNA), which captures common dependencies from normalized sequences and specific dependencies from raw sequences, and adaptively fuses them according to instance-level non-stationarity. Built upon ASNA, SeesawNet alternates dedicated temporal and channel relationship modeling to jointly capture long-range and cross-variable dependencies. Extensive experiments on multiple real-world benchmarks demonstrate that SeesawNet consistently outperforms state-of-the-art methods.

[LG-41] Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework

链接: https://arxiv.org/abs/2605.14550
作者: Phuc Truong Loc Nguyen,Thanh Hung Do,Truong Thanh Hung Nguyen,Hung Cao
类目: Machine Learning (cs.LG)
*备注: Accepted to the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026)

点击查看摘要

Abstract:Artificial intelligence in high-stakes tabular domains cannot be evaluated by predictive performance alone, yet current practice still assesses explainability, fairness, robustness, privacy, and sustainability mostly in isolation. We propose the Model Integrity and Responsibility Assessment Index (MIRAI), a unified evaluation framework that measures tabular models across these five dimensions under a controlled comparison setting and aggregates them into a single score. MIRAI combines established metrics through normalized and direction-aligned dimension scores, which enables direct comparison across models with different architectural and computational profiles. Experiments on healthcare, financial, and socioeconomic datasets show that higher predictive performance does not necessarily imply better overall integrity and responsibility. In several cases, simpler models achieve a stronger cross-dimensional balance than more complex deep tabular architectures. MIRAI provides a compact and practical basis for responsible model selection in regulated settings.

[LG-42] Discovering Physical Directions in Weight Space: Composing Neural PDE Experts

链接: https://arxiv.org/abs/2605.14546
作者: Pengkai Wang,Pengwei Liu,Yuanyi Wang,Guanyu Chen,Xingyu Ren,Xiaolong Li,Zhongkai Hao,Yuting Kong,Qixin Zhang,Dong Ni
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in neural operators have made partial differential equation (PDE) surrogate modeling increasingly scalable and transferable through large-scale pretraining and in-context adaptation. However, after a shared operator is fine-tuned to multiple regimes within a continuous physical family, it remains unclear whether the resulting weight-space updates merely form isolated regime experts or reveal reusable physical structure. Starting from a shared family anchor, we fine-tune low- and high-regime endpoint experts and show that their updates can be separated into a family-shared adaptation and a direction aligned with the underlying physical parameter. This separation reinterprets endpoint experts as finite-difference probes of a local physical direction in weight space, explaining why static averaging can interpolate between regimes but attenuates endpoint-specific physics. Building on this perspective, we propose Calibration-Conditioned Merge (CCM), a post-hoc coordinate readout method for composing neural PDE experts along this physical direction. Given physical metadata, a calibrated coordinate mapping, or a short observed rollout prefix, CCM infers the target composition coordinate and deploys a single merged checkpoint for the remaining rollout. We evaluate CCM on the reaction–diffusion system, viscosity-parameterized two-dimensional Navier–Stokes equations, and radial dam-break dynamics. Across these benchmarks, CCM achieves its strongest gains in extrapolative regimes, reducing out-of-distribution rollout error relative to the family anchor by 54.2%, 42.8%, and 13.8%, respectively. Further experiments across FNO scales, a DPOT-style backbone, and ablations confirm that endpoint fine-tuning is not arbitrary checkpoint drift, but reveals a calibratable physical direction for training-free transfer across PDE regimes.

[LG-43] Exploring Geographic Relative Space in Large Language Models through Activation Patching

链接: https://arxiv.org/abs/2605.14535
作者: Stef De Sabbata,Rahul Baiju,Stefano Mizzaro,Kevin Roitero
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increased use of Large Language Models (LLMs) in geography raises substantial questions about the safety of integrating these tools across a wide range of processes and analyses, given our very limited understanding of their inner workings. In this extended abstract, we examine how LLMs process relative geographic space using activation patching, an emerging tool for mechanistic interpretability.

[LG-44] Lang2MLIP: End-to-End Language-to-Machine Learning Interatomic Potential Development with Autonomous Agent ic Workflows

链接: https://arxiv.org/abs/2605.14527
作者: Wenwen Li,Yuki Orimo,Nontawat Charoenphakdee
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
*备注: 31 pages, 12 figures

点击查看摘要

Abstract:Developing machine learning interatomic potentials (MLIPs) for complex materials systems remains challenging because it requires expertise in atomistic simulations, machine learning, and workflow design, as well as iterative active learning procedures. Existing automated pipelines typically assume a fixed sequence of stages or depend on domain experts, which limits their adaptability to heterogeneous materials systems where the optimal curriculum is not known in advance. To lower the barrier to developing MLIPs for non-experts, we propose Lang2MLIP, a multi-agent framework that takes natural-language input and formulates end-to-end MLIP development as a sequential decision-making problem solved by large language models (LLMs). At each step, a decision-making agent observes the current dataset, model, evaluation results, and execution log, and then automatically selects an appropriate action to improve the model. This removes the need for a predefined pipeline and enables the agent to self-correct by revisiting earlier subsystems when new failures arise. We evaluate this approach on a solid electrolyte interphase (SEI) system with multiple components and interfaces. These results suggest that LLM-based multi-agent systems are a promising direction for automating MLIP development and making it more accessible to non-experts.

[LG-45] Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

链接: https://arxiv.org/abs/2605.14521
作者: Yuxin Guo,Yihao Yue,Yunhao Ni,Yizhou Ruan,Jie Luo,Wenjun Wu,Lei Huang
类目: Machine Learning (cs.LG)
*备注: 33 pages, 21 figures

点击查看摘要

Abstract:Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may discard benefits associated with centering. This paper propose a framework to determine whether an LN in an arbitrary DNN can be replaced by RMSNorm without changing the model function. The key idea is to fold LN’s centering operation into upstream general linear layers by enforcing zero-mean outputs through the column-centered constraint (CCC) and column-based weight centering (CBWC). We extend the analysis to arbitrary DNNs, define such LNs as foldable LNs, and develop a graph-based detection algorithm. Our analysis shows that many LNs in widely used architectures are foldable, enabling exact inference-time conversion and end-to-end acceleration of 2% to 12% without changing model predictions. Experiments across multiple task families further show that, when exact equivalence is partially broken in practical training settings, our method remains competitive with vanilla LN while improving efficiency.

[LG-46] A Novel Schur-Decomposition-Based Weight Projection Method for Stable State-Space Neural-Network Architectures

链接: https://arxiv.org/abs/2605.14489
作者: Sergio Vanegas,Lasse Lensu,Fredy Ruiz
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 32 pages, 13 figures. Source code at this https URL

点击查看摘要

Abstract:Building black-box models for dynamical systems from data is a challenging problem in machine learning, especially when asymptotic stability guarantees are required. In this paper, we introduce a novel stability-ensuring and backpropagation-compatible projection scheme based on the Schur decomposition for the state matrix of linear discrete-time state-space layers, as well as an alternative pre-factorized formulation of the methodology. The proposed methods dynamically project the quasi-triangular factor of the state matrix’s real Schur decomposition onto its nearest stable peer, ensuring stable dynamics with minimal overparameterization. Experiments on synthetic linear systems demonstrate that the method achieves accuracy and convergence rates comparable to those of state-of-the-art stable-system identification techniques, despite a marginal increase in computational complexity. Furthermore, the lower weight count facilitates convergence during training without sacrificing accuracy in stacked neural-network architectures with static nonlinearities targeting real-world datasets. These results suggest that the Schur-based projection provides a numerically robust framework for identifying complex dynamics on par with the State of the Art while satisfying strict asymptotic-stability requirements.

[LG-47] st-Time Learning with an Evolving Library

链接: https://arxiv.org/abs/2605.14477
作者: Weijia Xu,Alessandro Sordoni,Chandan Singh,Zelalem Gero,Michel Galley,Xingdi Yuan,Jianfeng Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce EvoLib, a test-time learning framework that enables large language models to accumulate, reuse, and evolve knowledge across problem instances without parameter updates or external supervision. Instead of adapting model parameters, our approach maintains a shared library of knowledge abstractions, including modular skills and reflective insights, automatically extracted from the model’s own inference trajectories. To support continual improvement, we introduce a principled weighting and consolidation mechanism that jointly optimizes for immediate utility and long-term value. This allows simple, instance-specific abstractions to evolve into more general and reusable ones over time. Across challenging benchmarks in mathematical reasoning, code generation, and multi-turn agentic environments, EvoLib improves substantially over the top test-time scaling and learning methods without ground-truth feedback.

[LG-48] Focused PU learning from imbalanced data

链接: https://arxiv.org/abs/2605.14467
作者: Elias Zavitsanos,Georgios Paliouras
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a new method of learning from positive and unlabeled (PU) examples in highly imbalanced datasets. Many real-world problems, such as disease gene identification, targeted marketing, fraud detection, and recommender systems, are hard to address with machine learning methods, due to limited labeled data. Often, training data comprises positive and unlabeled instances, the latter typically being dominated by negative, but including also several positive instances. While PU learning is well-studied, few methods address imbalanced settings or hard-to-detect positive examples that resemble negative ones. Our approach uses a focused empirical risk estimator, incorporating both positive and unlabeled examples to train binary classifiers. Empirical evaluations demonstrate state-of-the-art performance on imbalanced datasets under two labeling mechanisms - selecting positives completely at random (SCAR) and selecting at random (SAR). Beyond these controlled experiments, we demonstrate the value of the proposed method in the real-world application of financial misstatement detection.

[LG-49] FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

链接: https://arxiv.org/abs/2605.14445
作者: Runyuan He,Qiuyang Mang,Shang Zhou,Kaiyuan Liu,Hanchen Li,Huanzhi Mao,Qizheng Zhang,Zerui Li,Bo Peng,Lufeng Cheng,Tianfu Fu,Yichuan Wang,Wenhao Chai,Jingbo Shang,Alex Dimakis,Joseph E. Gonzalez,Alvin Cheung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many real-world coding challenges are open-ended and admit no known optimal solution. Yet, recent progress in LLM coding has focused on well-defined tasks such as feature implementation, bug fixing, and competitive programming. Open-ended coding remains a weak spot for LLMs, largely because open-ended training problems are scarce and expensive to construct. Our goal is to synthesize open-ended coding problems at scale to train stronger LLM coders. We introduce FrontierSmith, an automated system for iteratively evolving open-ended problems from existing closed-ended coding tasks. Starting from competitive programming problems, FrontierSmith generates candidate open-ended variants by changing the problems’goals, restricting outputs, and generalizing inputs. It then uses a quantitative idea divergence metric to select problems that elicit genuinely diverse approaches from different solvers. Agents then generate test cases and verifiers for the surviving candidates. On two open-ended coding benchmarks, training on our synthesized data yields substantial gains over the base models: Qwen3.5-9B improves by +8.82 score on FrontierCS and +306.36 (Elo-rating-based performance) on ALE-bench; Qwen3.5-27B improves by +12.12 and +309.12, respectively. The synthesized problems also make agents take more turns and use more tokens, similar to human-curated ones, suggesting that closed-ended seeds can be a practical starting point for long-horizon coding data.

[LG-50] What if Tomorrow is the World Cup Final? Counterfactual Time Series Forecasting with Textual Conditions

链接: https://arxiv.org/abs/2605.14422
作者: Shuqi Gu,Yongxiang Zhao,Baoyu Jing,Kan Ren
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series forecasting has become increasingly critical in real-world scenarios, where future sequences are influenced not only by historical patterns but also by forthcoming events. In this context, forecasting must dynamically adapt to complex and stochastic future conditions, which introduces fundamental challenges in both forecasting and evaluation. Traditional methods typically rely on historical data or factual future conditions, while overlooking counterfactual scenarios. Furthermore, many existing approaches are restricted to simple structured conditions, limiting their ability to generalize to the real-world complexities. To address these gaps, we introduce the task of counterfactual time series forecasting with textual conditions, enabling more flexible and condition-aware forecasting. We propose a comprehensive evaluation framework that encompasses both factual and counterfactual settings, even in the absence of ground truth time series. Additionally, we present a novel text-attribution mechanism that distinguishes mutable from immutable factors, thereby improving forecast accuracy under sophisticated and stochastic textual conditions. The project page is at this https URL

[LG-51] Watch your neighbors: Training statistically accurate chaotic systems with local phase space information

链接: https://arxiv.org/abs/2605.14405
作者: Joon-Hyuk Ko,Andrus Giraldo,Deok-Sun Lee
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Chaotic systems pose fundamental challenges for data-driven dynamics discovery, as small modeling errors lead to exponentially growing trajectory discrepancies. Since exact long-term prediction is unattainable, it is natural to ask what a good surrogate model for chaotic dynamics is. Prior work has largely focused either on reproducing the Jacobian of the underlying dynamics, which governs local expansion and contraction rates, or on training surrogate models that reproduce the ground-truth dynamics’ long-term statistical behavior. In this work, we propose a new framework that aims to bridge these two paradigms by training surrogate dynamics models with accurate Jacobians and long-term statistical properties. Our method constructs a local covering of a chaotic attractor in phase space and analyzes the expansion and contraction of these coverings under the dynamics. The surrogate model is trained by minimizing the maximum mean discrepancy between the pushforward distributions of the coverings under the surrogate and ground-truth dynamics. Experiments show that our method significantly improves Jacobian accuracy while remaining competitive with state-of-the-art statistically accurate dynamics learning methods. Our code is fully available at this https URL.

[LG-52] MoRe: Modular Representations for Principled Continual Representation Learning on Squantial Data

链接: https://arxiv.org/abs/2605.14364
作者: Jiaqi Sun,Boyang Sun,Mohamad Rasmy,Xiangchen Song,Kun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

[LG-53] Randomized Atomic Feature Models for Physics-Informed Identification of Dynamic Systems

链接: https://arxiv.org/abs/2605.14351
作者: Rajiv Singh,Mario Sznaier,Lennart Ljung
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Extended version of the conference paper submitted for IFAC World Congress, 2026

点击查看摘要

Abstract:We present a physics-informed framework for system identification based on randomized stable atomic features. Impulse responses are represented as random superpositions of stable atoms, namely damped complex exponentials associated with poles sampled inside a prescribed disk. Identification is then cast as a convex regularized least-squares problem with optional linear, second-order-cone, and KYP constraints. The approach generalizes random Fourier and random Laplace features to the damped, nonstationary regime relevant to engineering systems while retaining modal interpretability and scalable finite-dimensional computation. The main analytic point is an operator-theoretic Disk-Bochner viewpoint: positive measures over stable poles generate positive-definite kernels with a radius-dependent shift defect, while a converse scalar disk moment representation for an arbitrary kernel is characterized by subnormality of the canonical shift. We prove this statement, establish an RKHS-to-l1 embedding, show that sampled poles induce a valid finite atomic gauge, discuss random-feature convergence, and state sparse-recovery guarantees conditionally on the restricted-eigenvalue properties of the realized disk-Vandermonde or input-output design matrix. We also connect the normalized transfer function problem to Nevanlinna-Pick interpolation and LFT set-membership. The framework directly encodes stability margins, modal localization, DC-gain bounds, monotonicity, passivity, relative degree, settling-time targets, and time/frequency-domain error bounds. Numerical comparisons illustrate how physically meaningful priors can compensate for poor excitation and improve constrained impulse-response recovery in an under-informative data setting.

[LG-54] Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

链接: https://arxiv.org/abs/2605.14350
作者: Nicholas E. Corrado,Wenyuan Huang,Josiah P. Hanna
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-task reinforcement learning (MTRL) aims to train a single agent to efficiently optimize performance across multiple tasks simultaneously. However, jointly optimizing all tasks often yields imbalanced learning: agents quickly solve easy tasks but learn slowly on harder ones. While prior work primarily attributes this imbalance to conflicting task gradients and proposes gradient manipulation or specialized architectures to address it, we instead focus on a distinct and under-explored challenge: imbalanced data allocation. Standard MTRL allocates an equal number of environment interactions to each task, which over-allocates data to easy tasks that require relatively few interactions to solve and under-allocates data to hard tasks that require substantially more experience to solve. To address this challenge, we introduce Distributionally Robust Adaptive Task Sampling (DRATS), an algorithm that adaptively prioritizes sampling tasks furthest from being solved. We derive DRATS by formalizing MTRL as a feasibility problem from which we derive a minimax objective for minimizing the worst-case return gap, the difference between a desired target return and the agent’s return on a task. In benchmarks like MetaWorld-MT10 and MT50, DRATS improves data efficiency and increases worst-task performance compared to existing task sampling algorithms.

[LG-55] Exemplar Partitioning for Mechanistic Interpretability

链接: https://arxiv.org/abs/2605.14347
作者: Jessica Rumbelow
类目: Machine Learning (cs.LG)
*备注: Code: this https URL . Pretrained dictionaries: this https URL

点击查看摘要

Abstract:We introduce Exemplar Partitioning (EP), an unsupervised method for constructing interpretable feature dictionaries from large language model activations with \sim 10^3\times fewer tokens than comparable sparse autoencoders (SAEs). An EP dictionary is a Voronoi partition of activation space, built by leader-clustering streamed activations within a distance threshold. Each region is anchored by an observed exemplar that serves as both its membership criterion and intervention direction; dictionary size is not prespecified, but determined by the activation geometry at that threshold. Because exemplars are observed rather than learned, dictionaries built from the same data stream are directly comparable across layers, models, and training checkpoints. We characterise EP as an interpretability object via targeted demonstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and support causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those introduced by it. EP regions and Gemma Scope SAE features decompose activation space differently but agree on a shared core: \sim 20% of EP regions match an SAE feature at F_1 0.5 , and EP one-hot probes retain \sim 97% of raw-activation probe accuracy at \ell_0 = 1 . Nearest-exemplar distance provides a free out-of-distribution signal at inference. On AxBench latent concept detection at Gemma-2-2B-it L20, EP at p_1 reaches mean AUROC 0.881 , +0.126 over the canonical GemmaScope SAE leaderboard entry and within 0.030 of SAE-A’s 0.911 , at \sim 10^3\times less build compute. Comments: Code: this https URL. Pretrained dictionaries: this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.14347 [cs.LG] (or arXiv:2605.14347v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14347 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-56] Nearest-Neighbor Radii under Dependent Sampling

链接: https://arxiv.org/abs/2605.14343
作者: Yuanyuan Gao,Yilong Hou,Zhexiao Lin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 33 pages

点击查看摘要

Abstract:Nearest-neighbor methods are fundamental to classical and modern machine learning, yet their geometric properties are typically analyzed under independent sampling. In this paper, we study the nearest-neighbor radii under dependent sampling. We consider strong mixing dependent observations and ask whether dependence changes the scale of nearest-neighbor neighborhoods. We establish distribution-free almost sure convergence under polynomial mixing and sharp non-asymptotic moment bounds under geometric mixing. The moment bounds depend on the local intrinsic dimension rather than the ambient dimension, making the results applicable to high-dimensional data concentrated near lower-dimensional manifolds. Synthetic experiments and real-world time-series benchmarks support the theory, showing that nearest-neighbor geometry remains informative under dependence sampling.

[LG-57] Guided Diffusion Sampling for Precipitation Forecast Interventions

链接: https://arxiv.org/abs/2605.14317
作者: Ayumu Ueyama,Kazuhiko Kawamoto,Hiroshi Kera
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 12+7 pages, 7+2 figures

点击查看摘要

Abstract:Extreme precipitation causes severe societal and economic damage, and weather control has long been discussed as a potential mitigation strategy. However, to the best of our knowledge, perturbation-based interventions for weather control using data-driven weather forecasting models have not yet been explored. While adversarial attacks also generate perturbations that alter forecasts, they aim to exploit model artifacts and do not account for physical plausibility. In this paper, we propose a gradient-based guidance framework for precipitation-reduction interventions through diffusion sampling in diffusion-based weather forecasting models. Instead of directly perturbing atmospheric states, our method steers the diffusion sampling trajectory, enabling precipitation reduction while maintaining consistency with the atmospheric distribution. To assess physical plausibility, we evaluate from three perspectives: (i) vertical and variable-wise perturbation profiles, (ii) latent-space trajectory deviation, and (iii) cross-model transferability. Experiments on extreme precipitation events from WeatherBench2 demonstrate that our method achieves effective precipitation reduction while yielding more physically plausible interventions than adversarial perturbations.

[LG-58] Language-Induced Priors for Domain Adaptation

链接: https://arxiv.org/abs/2605.14301
作者: Qiyuan Chen,Jiayu Zhou,Raed Al Kontar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Domain adaptation faces a fundamental paradox in the cold-start regime. When target data is scarce, statistical methods fail to distinguish relevant source domains from irrelevant ones, which often leads to negative transfer. In this paper, we address this challenge by leveraging expert textual descriptions of the target domain, a resource that is often available but overlooked. We propose a probabilistic framework that translates these semantic descriptions into a choice model, namely a Language-Induced Prior (LIP), that learns the preferences from a pretrained Large Language Model (LLM). The LIP is then integrated into an Expectation-Maximization algorithm to identify source relevance. Methodologically, this framework is compatible with any parametric model where a likelihood is available. It allows the LIP to guide the selection of sources when target signals are weak, while gradually refining these choices as samples accumulate. Theoretically, we prove that the estimator roughly matches an oracle cold-start MSE under a correct prior, while remaining asymptotically consistent regardless of the quality of the LIP. Empirically, we validated the framework on a descriptive (Gaussian estimation), a predictive (C-MAPSS dataset), and a prescriptive task (MuJoCo hopper).

[LG-59] Smooth Multi-Policy Causal Effect Estimation in Longitudinal Settings

链接: https://arxiv.org/abs/2605.14284
作者: Wenxin Chen,Weishen Pan,Kyra Gan,Fei Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Comparative evaluation of multiple dynamic treatment policies is essential for healthcare and policy decisions, yet conventional longitudinal causal inference methods estimate each in isolation, preventing information sharing across counterfactuals. We demonstrate that this separate estimation paradigm induces a structurally uncontrolled second-order bias, inflating finite-sample variance even after standard debiasing with longitudinal targeted maximum likelihood estimation(LTMLE). To address this, we propose a policy-aware reparameterization of Iterative Conditional Expectation (ICE) Q-functions that enables joint estimation through shared representations. We implement this approach in the Policy-Encoded Q Network (PEQ-Net), an architecture centered on a shared policy encoder. The encoder is trained using kernel mean embeddings, ensuring that the learned representation space reflects population-level policy dissimilarities. After applying an LTMLE correction step, we prove this design imposes a structural constraint on the second-order remainder, thereby stabilizing finite-sample variance. Experiments on semi-synthetic datasets demonstrate that PEQ-Net consistently outperforms existing ICE-based methods, achieving substantial reductions in root-mean-square error, particularly when evaluating closely related policies.

[LG-60] ILT: Target-induced loss tilting under covariate shift NEURIPS2026

链接: https://arxiv.org/abs/2605.14280
作者: Kakei Yamamoto,Martin J. Wainwright
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 32 pages, 17 figures. Submitted to NeurIPS 2026

点击查看摘要

Abstract:We introduce and analyze Target-Induced Loss Tilting (TILT) for unsupervised domain adaptation under covariate shift. It is based on a novel objective function that decomposes the source predictor as f+b , fits f+b on labeled source data while simultaneously penalizing the auxiliary component b on unlabeled target inputs. The resulting fit f is deployed as the final target predictor. At the population level, we show that this target-side penalty implicitly induces relative importance weighting at the population level, but in terms of an estimand b^*_f that is self-localized to the current error, and remains uniformly bounded for any source-target pair (even those with disjoint supports). We prove a general finite-sample oracle inequality on the excess risk, and use it to give an end-to-end guarantee for training with sparse ReLU networks. Experiments on controlled regression problems and shifted CIFAR-100 distillation show that TILT improves target-domain performance over source-only training, exact importance weighting, and relative density-ratio baselines, with a stable dependence on the regularization parameter.

[LG-61] EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

链接: https://arxiv.org/abs/2605.14249
作者: Zhiye Song,Kyungmi Lee,Eun Kyung Lee,Xin Zhang,Tamar Eilam,Anantha P. Chandrakasan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

[LG-62] Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

链接: https://arxiv.org/abs/2605.14241
作者: Kexin Chu,Dawei Xiang,Wei Zhang
类目: Machine Learning (cs.LG)
*备注: 12 pages, 1 figure, 14 tables

点击查看摘要

Abstract:Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91–+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.

[LG-63] Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods NAACL2025

链接: https://arxiv.org/abs/2605.14240
作者: Andrii Shportko,Inessa Verbitsky
类目: Machine Learning (cs.LG)
*备注: NAACL 2025

点击查看摘要

Abstract:The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

[LG-64] How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

链接: https://arxiv.org/abs/2605.14200
作者: Leena Chennuru Vankadara,Moritz Haas,Luke Hayward,Sebastian Bordt,Alessandro Breccia
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width N , expert width N_e , number of experts M , sparsity K , and depth L to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling N\asymp N_e , (II) co-scaling N\asymp M\asymp K , and (III) full proportional scaling of N, N_e, M , and K . For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ( \mu ) desiderata. We then show that the resulting \mu P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the \mu P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

[LG-65] Stochastic Matching via Local Sparsification

链接: https://arxiv.org/abs/2605.14195
作者: Sara Ahmadian,Edith Cohen,Mohammad Roghani
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The classic online stochastic matching problem typically requires immediate and irrevocable matching decisions. However, in many modern decentralized systems such as real-time ride-hailing and distributed cloud computing, the primary bottleneck is often local communication bandwidth rather than the timing of the match itself. We formalize this challenge by introducing a two-stage local sparsification framework. In this setting, arriving requests must prune their realized compatibility sets to a strict budget of k edges before a central coordinator optimizes the global matching. This creates a “middle ground” between local information constraints and global optimization utility. We propose a local selection strategy, parametrized by a fractional solution of the expected instance. Theoretically, we quantify the approximation ratio as a function of the solution’s \em spread. We prove that under sufficient spread, our sparsifier globally preserves the expected size of the maximum matching. Empirically, we demonstrate the robustness of our approach using the New York City ride-hailing datasets and adversarial synthetic benchmarks. Our results show that near-optimal global matching is achievable even with highly constrained local budgets, significantly outperforming standard online baselines. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2605.14195 [cs.DS] (or arXiv:2605.14195v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2605.14195 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-66] LLM s Know When They Know but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

链接: https://arxiv.org/abs/2605.14186
作者: Qi Cao,Yufan Wang,Peijia Qin,Shuhao Zhang,Pengtao Xie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) often expose useful signals of self-monitoring: before solving a problem, they can estimate whether they are likely to succeed, and after solving it, they can judge whether their answer is likely to be correct. However, these signals are typically measured or elicited in isolation, rather than used to control inference. In this work, we ask whether LLMs possess latent metacognitive ability that can be turned into effective test-time control. Inspired by the Nelson–Narens theory from cognitive psychology, we propose a metacognitive harness that separates monitoring from reasoning. For each problem, the model first reports a pre-solve feeling-of-knowing (FOK) signal; after each solve attempt, it reports a post-solve judgment-of-learning (JOL) signal. Rather than treating these signals as passive confidence estimates, the harness turns them into an explicit control interface for reasoning: it decides when to trust the current solution, when to retry with compact metacognitive feedback, and when to pass multiple attempts to a final aggregator. Across text, code, and multimodal reasoning benchmarks, our harness substantially improves a fixed Claude Sonnet-4.6 base model without parameter updates or benchmark-specific fine-tuning. On the evaluated public benchmark snapshots, it raises pooled accuracy from 48.3 to 56.9 and exceeds the strongest listed leaderboard entries on the three primary evaluation settings: HLE-Verified, LiveCodeBench v6, and R-Bench-V. These results suggest that strong LLMs may already possess useful metacognitive ability, but require an explicit control harness to act on it during reasoning.

[LG-67] CSI-JEPA: Towards Foundation Representations for Ubiquitous Sensing with Minimal Supervision

链接: https://arxiv.org/abs/2605.14171
作者: Xuanhao Luo,Zhizhen Li,Yuchen Liu
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:

点击查看摘要

Abstract:Channel state information (CSI) provides a widely available sensing modality for human and environment perception, but existing CSI sensing models usually rely on task-specific supervised training and require substantial labeled data for each task, device, user, or environment. This limits their scalability in practical deployments where unlabeled CSI is abundant but labeled data is costly to collect. In this paper, we present CSI-JEPA, a self-supervised predictive representation learning framework for label-efficient, multi-task Wi-Fi sensing. CSI-JEPA learns reusable temporal-spectral representations from unlabeled CSI samples by predicting latent features of masked channel regions from visible context. To better match the physical structure of CSI, CSI-JEPA tokenizes channel-response amplitude windows along the time and subcarrier dimensions. It then introduces a channel variation-aware masking strategy that samples predictive targets from regions with stronger local temporal and subcarrier-domain variations. After pretraining, the encoder is frozen and used as a backbone, with lightweight task-specific adapters added for downstream sensing tasks. We evaluate CSI-JEPA on seven real-world Wi-Fi sensing tasks spanning diverse objectives and deployment settings. The results show that CSI-JEPA improves downstream sensing performance over competitive baselines, achieving up to 10.64 percentage points mean accuracy gain over state-of-the-art supervised Transformer and matched-budget label savings of up to 98.0%.

[LG-68] Finite Sample Bounds for Learning with Score Matching

链接: https://arxiv.org/abs/2605.14168
作者: Devin Smedira,Abhijith Jayakumar,Sidhant Misra,Marc Vuffray,Andrey Y. Lokhov
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注: 22 pages

点击查看摘要

Abstract:Learning of continuous exponential family distributions with unbounded support remains an important area of research for both theory and applications in high-dimensional statistics. In recent years, score matching has become a widely used method for learning exponential families with continuous variables due to its computational ease when compared against maximum likelihood estimation. However, theoretical understanding of the statistical properties of score matching is still lacking. In this work, we provide a non-asymptotic sample complexity analysis for learning the structure of exponential families of polynomials with score matching. The derived sample bounds show a polynomial dependence on the model dimension. These bounds are the first of its kind, as all prior work has shown only asymptotic bounds on the sample complexity.

[LG-69] Uncovering Trajectory and Topological Signatures in Multimodal Pediatric Sleep Embeddings ML4H2025

链接: https://arxiv.org/abs/2605.14156
作者: Scott Ye,Harlin Lee
类目: Machine Learning (cs.LG)
*备注: Accepted to ML4H 2025, 20 pages, 6 figures

点击查看摘要

Abstract:While generative models have shown promise in pediatric sleep analysis, the latent structure of their multimodal embeddings remains poorly understood. This work investigates session-wide diagnostic information contained in the sequences of 30-second pediatric PSG epochs embedded by a multimodal masked autoencoder. We test whether augmenting embeddings with PHATE-derived per-epoch coordinates and whole-night movement descriptors, persistent homology summaries of the embedding cloud, and EHR yields task-relevant signals. Simple linear and MLP models, chosen for interpretability rather than state-of-the-art performance, show that geometric, topological, and clinical features each provide complementary gains. For binary predictions, feature importance is task-dependent, and more expressive late-fusion models generally perform better, with AUPRC improving from 0.26 to 0.34 for desaturation, 0.31 to 0.48 for EEG arousal, 0.09 to 0.22 for hypopnea, and 0.05 to 0.14 for apnea. We also report Brier score and Expected Calibration Error, where the full fusion model yields the best calibration across all four binary tasks. Our study reveals that latent geometry/topology and EHR offer complementary, interpretable signals beyond embeddings, improving calibration and robustness under extreme imbalance.

[LG-70] A Systematic Evaluation of Imbalance Handling Methods in Biomedical Binary Classification

链接: https://arxiv.org/abs/2605.14147
作者: Jiandong Chen,Lingjie Su,Le Peng,Yash Travadi,Rui Zhang,Ju Sun
类目: Machine Learning (cs.LG)
*备注: 18 pages, 1 figures, 4 tables

点击查看摘要

Abstract:Objective: The primary goal of this study was to systematically examine the impact of commonly used imbalance handling methods (IHMs) on predictive performance in biomedical binary classification, considering the interplay between model complexity and diverse data modalities. Material and Methods: We evaluated five representative IHMs: random undersampling (RUS), random oversampling (ROS), SMOTE, re-weighting (RW), and direct F1-score optimization (DMO), against a raw training (RAW) baseline. The evaluation encompassed three public biomedical datasets: MIMIC-III (tabular), ADE-Corpus-V2 (text), and MURA (image), spanning three common biomedical data modalities. To assess varying model complexity, we employed a range of architectures, from classical logistic regression and random forest to deep neural networks, including multilayer perceptron (MLP), BiLSTM, BERT, DenseNet, and DINOv2. Results: For simpler models such as logistic regression on tabular data, IHMs yielded no significant advantage over the RAW baseline, aligning with prior findings. However, clear benefits were observed for more complex models and unstructured data: (a) ROS and RW consistently enhanced the performance of powerful models; (b) direct F1-score optimization demonstrated utility primarily for unstructured text and image data; and © RUS and SMOTE consistently degraded performance and are therefore not recommended. Conclusion: The effectiveness of IHMs depends on both model complexity and data modality. Performance gains are most pronounced when leveraging appropriate IHMs, such as ROS, RW, and DMO, on high-complexity models. Comments: 18 pages, 1 figures, 4 tables Subjects: Machine Learning (cs.LG) MSC classes: 68T07, 62H30, 62P10 Cite as: arXiv:2605.14147 [cs.LG] (or arXiv:2605.14147v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14147 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Jiandong Chen [view email] [v1] Wed, 13 May 2026 21:57:38 UTC (948 KB)

[LG-71] bde: A Python Package for Bayesian Deep Ensembles via MILE

链接: https://arxiv.org/abs/2605.14146
作者: Vyron Arvanitis,Angelos Aslanidis,Emanuel Sommer,David Rügamer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:bde is a user-friendly Python package for Bayesian Deep Ensembles with a particular focus on tabular data. Built on an efficient JAX implementation of the sampling-based inference method Microcanonical Langevin Ensembles (MILE), it provides scikit-learn compatible estimators for fast training, efficient Markov Chain Monte Carlo sampling, and uncertainty quantification in both regression and classification tasks.

[LG-72] Synthetic Sociality: How Generative Models Privatize the Social Fabric

链接: https://arxiv.org/abs/2605.14090
作者: Ana Dodik,Moira Weigel
类目: Computers and Society (cs.CY); Graphics (cs.GR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We put forth a critical theoretical framework for analyzing generative models both descriptively and normatively. Our thesis is that generative models automate the production not only of intellectual labor or intelligence, but of a broader set of human social capacities we name “social doing.” We do this by historicizing the commodification of sociality in the digital economy, leading to the availability of social data as the precondition for generative models. We elaborate our definition of “social doing” by drawing a distinction between “use” and “exchange” sociality and further differentiate between the ways that generative models either substitute for or mediate existing social relations and processes. We then turn to existing empirical research on how people use generative model-based products and the effects that their use has upon them. In this, we introduce the concept of Synthetic Sociality, a social reality in part fabricated by Silicon Valley’s privately owned and undemocratically governed generative models. Lastly, we offer a normative analysis based on our findings and framework, and discuss future design opportunities.

[LG-73] Fair and Calibrated Toxicity Detection with Robust Training and Abstention

链接: https://arxiv.org/abs/2605.14074
作者: Mokshit Surana
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness in toxicity classification involves three integrated axes: ranking, calibration, and abstention. Training-time interventions and post-hoc safety mechanisms cannot be evaluated independently because the former determines the efficacy of the latter. We compare Empirical Risk Minimization (ERM), instance-level reweighting, and Group DRO across these axes, combined with temperature scaling, confidence-based abstention, and per-identity threshold optimization. Evaluation uses subgroup AUC, BPSN/BNSP AUC, error gaps, and per-subgroup Expected Calibration Error (ECE) with bootstrap CIs ( n = 1000 ). We report four findings. (1) Calibration disparity is a hidden fairness violation. ERM has near-perfect aggregate calibration ( 0.013 ) but is significantly miscalibrated across all identity subgroups ( +0.029 to +0.134 ). (2) Training interventions reshape rather than eliminate disparity. Reweighted ERM improves ranking (BPSN AUC +0.06 to +0.12 ) but worsens the calibration-fairness gap by up to +0.232 . Group DRO eliminates calibration disparity but only by becoming uniformly miscalibrated globally (ECE 0.118 ). (3) Post-hoc methods inherit training failure modes. Temperature scaling fails because miscalibration is non-uniform. Confidence-based abstention works under ERM but breaks under DRO, where the risk-coverage curve rises with deferral. (4) Abstention itself is unfair. Confidence-based deferral helps background content far more than identity-mentioning content. We argue that SRAI fairness requires a multi-axis framework: methods that differ only in aggregate ranking can differ sharply in failure modes that determine real-world harm. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.14074 [cs.LG] (or arXiv:2605.14074v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14074 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-74] SurF: A Generative Model for Multivariate Irregular Time Series Forecasting

链接: https://arxiv.org/abs/2605.14069
作者: Mohammad R. Rezaei,Tejas Balaji,Rahul G. Krishnan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Irregularly sampled multivariate event streams remain a stubbornly difficult modality for generative modeling: tokenization-based approaches break down when inter-event intervals vary by orders of magnitude, and neural temporal point processes are bottlenecked by window-level numerical quadrature. We (i) propose SurF, a generative model that uses the Time Rescaling Theorem (TRT) as a learnable bijection between event sequences and i.i.d.\ unit-rate exponential noise, enabling a single model to be trained across heterogeneous event-stream datasets; (ii) three efficient parameterizations of the cumulative intensity that scale to long sequences; and (iii) a Transformer-based encoder for multi-dataset pretraining. On six real-world benchmarks, SurF achieves the best reported time RMSE on Earthquake, Retweet, and Taobao, and is within trial-level noise of the strongest specialist on the remaining three. Under a strict leave-one-out protocol, the held-out checkpoint beats every classical and neural-autoregressive baseline on 5/6 datasets and beats every baseline on Amazon and Earthquake, an initial step toward foundation models over asynchronous event streams.

[LG-75] Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints

链接: https://arxiv.org/abs/2605.14067
作者: Karan Sehgal,Khawar Naveed Bhatti
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, preprint under review. Applied machine learning evaluation involving imbalance-aware financial distress prediction, ensemble learning, SMOTE, and SHAP explainability analysis

点击查看摘要

Abstract:Financial distress prediction remains a significant challenge in enterprise risk analysis due to the highly imbalanced nature of real-world financial datasets, where bankrupt or distressed firms typically constitute only a small minority of observations. This paper presents a comparative evaluation of classical statistical methods, ensemble learning approaches, and exploratory neural models for minority-class financial distress prediction under class imbalance constraints. The study incorporates structured preprocessing, imbalance mitigation using the Synthetic Minority Oversampling Technique (SMOTE), comparative evaluation across ensemble learning architectures including XGBoost, CatBoost, LightGBM, Random Forest, and explainability analysis using SHAP-based feature attribution methods. Experimental evaluation demonstrates that gradient-boosting approaches achieved improved minority-class sensitivity relative to baseline statistical classifiers under severe imbalance conditions. The workflow additionally emphasises reproducibility, interpretability, auditability, and governance-oriented machine learning evaluation within enterprise financial risk environments. The work is positioned as an applied engineering evaluation intended to support reproducible and interpretable machine learning workflows for financial distress prediction under severe class imbalance constraints. Comments: 16 pages, 4 figures, preprint under review. Applied machine learning evaluation involving imbalance-aware financial distress prediction, ensemble learning, SMOTE, and SHAP explainability analysis Subjects: Machine Learning (cs.LG) Cite as: arXiv:2605.14067 [cs.LG] (or arXiv:2605.14067v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.14067 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Karan Sehgal [view email] [v1] Wed, 13 May 2026 19:44:25 UTC (4,218 KB) Full-text links: Access Paper: View a PDF of the paper titled Comparative Evaluation of Machine Learning Approaches for Minority-Class Financial Distress Prediction Under Class Imbalance Constraints, by Karan Sehgal and Khawar Naveed BhattiView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-05 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status

[LG-76] Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

链接: https://arxiv.org/abs/2605.14063
作者: Vikash Singh,Debargha Ganguly,Weicong Chen,Sabyasachi Sahoo,Sreehari Sankar,Biyao Zhang,Mohsen Harir,Shouren Wang,Osama Zafar,Christian Gagné,Vipin Chaudhary
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately 1.3% top- 1 accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source’s normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID’s base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on 8 of 9 matched-split continual-corruption cells and is the best reset-based method on all 9 , improving ROID+ASR by 1.05 ~pp on ResNet-50 and 0.48 ~pp on ViT-B/16. A controlled source-degradation sweep shows a 1.13\times shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

[LG-77] Support Before Frequency in Discrete Diffusion

链接: https://arxiv.org/abs/2605.13999
作者: Adrian Müller,Antoine Gonon,Zebang Shen,Ya-Ping Hsieh,Niao He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Discrete diffusion models are increasingly competitive for language modeling, yet it remains unclear how their denoising objectives organize learning. Although these objectives target the full data distribution, we show that the exact reverse process induces a hierarchy between coarse support information and finer frequency information. For uniform and absorbing (a.k.a. masking) diffusion, we prove that, in the small-noise regime of the final denoising steps, each single-token reverse edit decomposes into a leading scale, determined by whether it moves toward the data support (e.g., grammatically valid sentences), and a finer coefficient, determining relative probabilities within the same scale. Thus, recovering validity structure only requires learning the correct order of magnitude of reverse probabilities, whereas recovering data frequencies requires coefficient-level estimation. The separation is mechanism-dependent: uniform diffusion exhibits a trichotomy into validity-improving, validity-preserving, and validity-worsening edits, while absorbing diffusion places its leading-order mass on validity-improving moves. Experiments on a masked language diffusion model and synthetic regular-language tasks support these predictions: support-localization emerges earlier than within-support frequency ranking, and the contrast between uniform and absorbing diffusion matches the predicted rate separation. Together, our results suggest that discrete diffusion models learn data support before data frequencies.

[LG-78] Neural Fields for NV-Center Inverse Sensing

链接: https://arxiv.org/abs/2605.13988
作者: Zhixuan Zhao,Tao Zhong,Yixun Hu,Nathalie P. de Leon,Christine Allen-Blanchette
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 33 pages, 16 figures

点击查看摘要

Abstract:Inverse problems in scientific sensing are often solved with either hand-designed regularizers or supervised networks trained on simulated labels, yet both can fail when the forward model is nonlinear, spectrally coupled, and physically delicate. We study this issue for noise sensing based on nitrogen-vacancy (NV) centers in diamond, where a quantum sensor measures magnetic-noise spectra generated by sparse spin sources. We show that replacing a common scalar/coherent forward approximation with a tensor power-summed dipolar operator changes the inverse landscape and exposes a center-collapse failure mode in free-density optimization. We propose NeTMY, an amortization-free coordinate neural field coupled to the differentiable NV forward model, with annealed positional encoding, multiscale optimization, sparsity/gating, and spectrum-fidelity losses. Across sparse synthetic reconstructions generated by the corrected operator, NeTMY achieves the best localization and distributional metrics in the tested benchmark. Mechanism experiments show that NeTMY does not directly execute the raw density-space gradient; its parameterization smooths and redistributes updates, mitigating the center-collapse pathology. These results position NV quantum sensing as a useful testbed for physics-faithful neural inverse problems.

[LG-79] abPFN-3: Technical Report

链接: https://arxiv.org/abs/2605.13986
作者: Léo Grinsztajn,Klemens Flöge,Oscar Key,Felix Birkel,Philipp Jund,Brendan Roof,Mihir Manium,Shi Bin(Liam)Hoo,Magnus Bühler,Anurag Garg,Dominik Safaric,Jake Robertson,Benjamin Jäger,Simone Alessi,Adrian Hayler,Vladyslav Moroshan,Lennart Purucker,Philipp Singer,Alan Arazi,Julien Siems,Jan Hendrik Metzen,Georg Grab,Nick Erickson,Siyuan Guo,Eliott Kalfon,Simon Bing,David Salinas,Clara Cornu,Lilly Charlotte Wehrhahn,Diana Kriuchkova,Kursat Kaya,Lydia Sidhoum,Marie Salmon,Jerry Chen,Madelon Hulsebos,Yann LeCun,Samuel Müller,Bernhard Schölkopf,Sauraj Gambhir,Noah Hollmann,Frank Hutter
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

[LG-80] A Unified Geometric Framework for Weighted Contrastive Learning

链接: https://arxiv.org/abs/2605.13943
作者: Raphael Vock,Edouard Duchesnay,Benoit Dufumier
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:Contrastive learning (CL) aims to preserve relational structure between samples by learning representations that reflect a similarity graph. Yet, the geometry of the resulting embeddings remains poorly understood. Here we show that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, where the weighting scheme specifies the target geometry to be realized by the representation. This viewpoint yields exact characterizations of the optimal embeddings for several supervised and weakly supervised objectives. In supervised classification, both SupCon and Soft SupCon (a dense relaxation of it where pairs from distinct classes have small non-zero similarity) collapse samples within each class to a single prototype. However, while balanced SupCon recovers the classical regular simplex geometry, class imbalance breaks this symmetry: SupCon induces non-uniform inter-class similarities depending on class sizes, whereas Soft SupCon preserves a regular simplex geometry regardless of class imbalance. In continuous-label settings, our framework reveals a different failure mode: y-Aware CL generally cannot attain its entropic optimum unless the labels lie on a hypersphere, exposing a mismatch between Euclidean label weights and spherical latent similarity. By contrast, geometrically consistent choices such as Euclidean-Euclidean weighting or X-CLR admit unique optimal embeddings. Our results show that the choice of weighting scheme determines whether contrastive learning is geometrically realizable, degenerate, or inconsistent, providing a principled framework for designing contrastive objectives.

[LG-81] EMA: Efficient Model Adaptation for Learning-based Systems

链接: https://arxiv.org/abs/2605.13942
作者: Daiyang Yu,Xinyu Chen,Yihan Zhang,Yan Liang,Yaqi Qiao,Fan Lai
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI)
*备注: SIGCOMM (2026)

点击查看摘要

Abstract:Machine learning (ML) is increasingly applied to optimize system performance in tasks such as resource management and network simulation. Unlike traditional ML tasks (e.g., image classification), networked systems often operate in heterogeneous, long-running, and dynamic environment states, where input conditions (e.g., network loads) and operational objectives can shift over time and across settings. Existing learning-based systems offer little support for adaptation, resulting in costly model training, extensive data collection, degraded system performance, and slow responsiveness. This paper presents EMA, the first model adaptation system supporting learning-based systems to adapt to evolving environments with minimal operational overhead. EMA takes a system-driven, data-centric approach that accommodates diverse system and model designs while addressing two key deployment challenges. First, it reduces expensive model training by introducing state transformers that align the input state of a new environment with previously similar states, allowing models to warm-start adaptation. Second, it addresses the often-overlooked yet costly process of data labeling–collecting ground truth for exploring and training on various system decisions–by prioritizing labeling high-utility data while balancing the tradeoff between training and labeling cost. Evaluations on eight representative learning-based systems show that EMA reduces adaptation costs (e.g., GPU training time) by 14.9-42.4% while improving system performance (e.g., network throughput) by 6.9-31.3%. Comments: SIGCOMM (2026) Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Networking and Internet Architecture (cs.NI) Cite as: arXiv:2605.13942 [cs.LG] (or arXiv:2605.13942v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2605.13942 Focus to learn more arXiv-issued DOI via DataCite

[LG-82] Rethinking Molecular OOD Generalization via Target-Aware Source Selection

链接: https://arxiv.org/abs/2605.13932
作者: Zhuohao Lin,Kun Li,Jiameng Chen,Jiajun Yu,Duanhua Cao,Yizhen Zheng,Wenbin Hu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Robust prediction of molecular properties under extreme out-of-distribution (OOD) scenarios is a pivotal bottleneck in AI-driven drug discovery. Current scaffold-splitting protocols fail to obstruct microscopic semantic overlap, predisposing models to shortcut learning and overestimating their true extrapolation capability; meanwhile, conventional domain adaptation paradigms suffer under extreme structural shifts, as blindly aligning heterogeneous source libraries injects topological noise and triggers negative transfer. To address these two challenges, scaffold-cluster out-of-distribution performance evaluation benchmark (SCOPE-BENCH), a benchmark built on cluster-level partitioning in an explicit physicochemical descriptor space, is proposed alongside policy optimization for multi-source adaptation (POMA), a framework that formulates knowledge transfer as a retrieve-compose-adapt pipeline: labeled source scaffolds structurally close to the unlabeled target are first identified as proxy targets; a reinforcement learning policy then adaptively selects the optimal source subset from an exponentially large candidate pool; and dual-scale domain adaptation is finally performed at macroscopic topological and microscopic pharmacophore scales. Evaluations show that prediction errors of state-of-the-art 3D molecular models surge by up to 8.0x on SCOPE-BENCH with a mean of 5.9x, while POMA achieves up to an 11.2% reduction in mean absolute error with an average relative improvement of 6.2% across diverse backbone architectures. Code is available at this https URL.

[LG-83] XAI and Statistical Analysis for Reliable Intrusion Detection in the UAVIDS-2025 Dataset: From Tree to Hybrid and Tabular DNN Ensembles

链接: https://arxiv.org/abs/2605.13922
作者: Iakovos-Christos Zarkadis,Christos Douligeris
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:During the last few years, the term Mechanistic Interpretability, a specific area, under the umbrella of explainable artificial intelligence (XAI), has been introduced, to explain the decisions made by complex machine learning (ML) models in critical systems like UAV intrusion detection systems (UAVIDS). In this paper, we apply best-practices for data pre-processing and examine a wide range of tree-ensembles, deep neural networks, hybrid stacking models and the latest ensemble neural networks to detect intrusions in UAV, with stratified 10-fold cross validation. With our top-performing model, XGBoost, we proceed to Shapley Additive explanations (SHAP), to analyze the global and local feature importances and understand which features, each attack targets, to mimic normal traffic and where the misclassifications occur. Furthermore a distribution analysis follows, by visually comparing violin plots and the curves of kernel density estimations. With the Westfall-Young permutation test for multiple comparisons, the Bandwidth optimization of the KDEs and the selection of Jensen-Shannon Distance for the test, we discover the true causes of false predictions, observed in Wormhole and Blackhole attacks in UAVIDS-2025. The findings provide robust, reliable and explainable models for UAV intrusion detection, along with statistical insights, which capture and clarify the masked nature of the attacks, regarding the challenge of Density Support Intersection, between these attacks, in this dataset.

[LG-84] CA2: Code-Aware Agent for Automated Game Testing

链接: https://arxiv.org/abs/2605.13918
作者: Valliappan Chidambaram Adaikkappan,Vincent Martineau,Joshua Romoff,David Meger
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automated game testing is important for verifying game functionality, but it remains a costly and time-consuming process. Manual testing often misses edge cases, and current automated methods struggle to provide full code coverage. Prior work has explored reinforcement learning (RL) for game testing, but without leveraging internal code signals such as the call stack. We present Code Aware Agent (CA2), which uses call stack information to learn effective testing strategies. The agent receives the current function call trace along with the game state and learns to reach specific target functions. We instrument two types of environments, 1) State-based and 2) Image-based, with support for efficient call stack extraction. Through experimental evaluation, we find that CA2 achieves consistent improvement over the non-code aware baselines, which does not leverage call stack information. Our results show that incorporating code signals like the call stack enables more effective and targeted game testing.

[LG-85] Indian Wedding System Optimization (IWSO): A Novel Socially Inspired Metaheuristic with Operational Design and Analysis

链接: https://arxiv.org/abs/2605.13871
作者: Deepika Saxena,Kishu Gupta,Jitendra Kumar,Jatinder Kumar,Sakshi Patni,Vinaytosh Mishra,Niharika Singh,Ashutosh Kumar Singh
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper presents a novel population-based metaheuristic, Indian Wedding System Optimization (IWSO), inspired by the socio-cultural dynamics of traditional Indian weddings. IWSO models the matchmaking process driven by collaboration among families, candidates, and matchmakers as a guided, selective search framework for solving complex optimization problems. The algorithm introduces two key innovations: (i) a matchmaker-guided influence strategy, where elite solutions direct the evolution of weaker candidates, enhancing convergence without external parameters; and (ii) an adaptive elimination and reinitialization mechanism that maintains diversity and prevents premature convergence by replacing underperforming individuals. IWSO employs a weighted multi-objective fitness function and analytically derived time and space complexity, benchmarked against existing optimization approaches such as Genetic Algorithm (GA), Partical Swarm Optimization (PSO), Differential Evolution (DE), Cuckoo Search (CS), etc. Extensive experiments on benchmark high-dimensional and multimodal test functions demonstrate superior performance of IWSO in terms of convergence speed, solution quality, and robustness.

[LG-86] Neuromorphic Graph Anomaly Detection via Adaptive STDP and Spiking Graph Neural Networks

链接: https://arxiv.org/abs/2605.13863
作者: Abdul Joseph Fofanah,Lian Wen,David Chen,Tsungcheng Yao,Kwabena Sarpong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Anomaly detection in dynamic networks is critical for applications from cybersecurity to industrial monitoring, yet existing methods face challenges in energy efficiency, temporal precision, and adaptability. This paper introduces ASTDP-GAD, a novel Adaptive Spiking Temporal Dynamics Plasticity framework for Graph Anomaly Detection that integrates spiking graph neural networks with STDP learning for energy-efficient neuromorphic detection in dynamic networks. Our framework unifies spiking neural computation, STDP learning, and graph-based anomaly detection through the following key innovations: temporal spike graph encoding with adaptive Leaky Integrate-and-Fire (LIF) dynamics; LIF-based graph attention with lateral inhibition; event-driven hypergraph memory with STDP-inspired prototype updates; spike rate contrast pooling based on spiking irregularity; adaptive STDP layers capturing causal temporal relationships; and multi-scale temporal convolution with multi-factor anomaly fusion. Theoretical analysis provides rigorous guarantees: spike encoding preserves input information with resolution scaling linearly in simulation steps and hidden dimension; LIFGAT approximates any continuous attention function; hypergraph memory converges to optimal prototypes; contrast pooling achieves provable anomaly selection bounds; STDP learning converges stably; and multi-factor fusion produces calibrated scores with up to 5\times variance reduction. Extensive experiments on nine datasets on both dynamic and static graphs demonstrate superior anomaly detection accuracy while maintaining biological plausibility and energy efficiency for neuromorphic deployment.

[LG-87] he Geometry of LLM Quantization: GPT Q as Babais Nearest Plane Algorithm ICLR2026

链接: https://arxiv.org/abs/2507.18553
作者: Jiale Chen,Yalda Shabanzadeh,Elvir Crnčević,Torsten Hoefler,Dan Alistarh
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Information Theory (cs.IT)
*备注: Published as a conference paper at the Fourteenth International Conference on Learning Representations (ICLR 2026): this https URL

点击查看摘要

Abstract:Quantizing the weights of large language models (LLMs) from 16-bit to lower bitwidth is the de facto approach to deploy massive transformers onto more affordable accelerators. While GPTQ emerged as one of the standard methods for one-shot post-training quantization at LLM scale, its inner workings are described as a sequence of algebraic updates that obscure geometric meaning or worst-case guarantees. In this work, we show that, when executed back-to-front (from the last to first dimension) for a linear layer, GPTQ is mathematically identical to Babai’s nearest plane algorithm for the classical closest vector problem (CVP) on a lattice defined by the Hessian matrix of the layer’s inputs. This equivalence is based on a sophisticated mathematical argument, and has two analytical consequences: first, the GPTQ error propagation step gains an intuitive geometric interpretation; second, GPTQ inherits the error upper bound of Babai’s algorithm under the assumption that no weights are clipped. Leveraging this bound, we design post-training quantization methods that avoid clipping, and outperform the original GPTQ. In addition, we provide efficient GPU inference kernels for the resulting representation. Taken together, these results place GPTQ on a firm theoretical footing and open the door to importing decades of progress in lattice algorithms towards the design of future quantization algorithms for billion-parameter models. Source code is available at this https URL.

[LG-88] Scalable Mechanistic Neural Networks for Differential Equations and Machine Learning ICLR2025

链接: https://arxiv.org/abs/2410.06074
作者: Jiale Chen,Dingling Yao,Adeel Pervez,Dan Alistarh,Francesco Locatello
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: Published as a conference paper at the Thirteenth International Conference on Learning Representations (ICLR 2025): this https URL

点击查看摘要

Abstract:We propose Scalable Mechanistic Neural Network (S-MNN), an enhanced neural network framework designed for scientific machine learning applications involving long temporal sequences. By reformulating the original Mechanistic Neural Network (MNN) (Pervez et al., 2024), we reduce the computational time and space complexities from cubic and quadratic with respect to the sequence length, respectively, to linear. This significant improvement enables efficient modeling of long-term dynamics without sacrificing accuracy or interpretability. Extensive experiments demonstrate that S-MNN matches the original MNN in precision while substantially reducing computational resources. Consequently, S-MNN can drop-in replace the original MNN in applications, providing a practical and efficient tool for integrating mechanistic bottlenecks into neural network models of complex dynamical systems. Source code is available at this https URL.

[LG-89] RoSHAP: A Distributional Framework and Robust Metric for Stable Feature Attribution

链接: https://arxiv.org/abs/2605.15154
作者: Lanxin Xiang,Liang Shi,Youhui Ye,Boyu Jiang,Dawei Zhou,Feng Guo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Feature attribution analysis is critical for interpreting machine learning models and supporting reliable data-driven decisions. However, feature attribution measures often exhibit stochastic variation: different train–test splits, random seeds, or model-fitting procedures can produce substantially different attribution values and feature rankings. This paper proposes a framework for incorporating stochastic nature of feature attribution and a robust attribution metric, RoSHAP, for stable feature ranking based on the SHAP metric. The proposed framework models the distribution of feature attribution scores and estimates it through bootstrap resampling and kernel density estimation. We show that, under mild regularity conditions, the aggregated feature attribution score is asymptotically Gaussian, which greatly reduces the computational cost of distribution estimation. The RoSHAP summarizes the distribution of SHAP into a robust feature-ranking criterion that simultaneously rewards features that are active, strong, and stable. Through simulations and real-data experiments, the proposed framework and RoSHAP outperform standard single-run attribution measures in identifying signal features. In addition, models built using RoSHAP-selected features achieve predictive performance comparable to full-feature models while using substantially fewer predictors. The proposed RoSHAP approach improves the stability and interpretability of machine learning models, enabling reliable and consistent insights for analysis.

[LG-90] From Data to Action: Accelerating Refinery Optimization with AI

链接: https://arxiv.org/abs/2605.15085
作者: Dániel Pfeifer,Ábrahám Papp,Tibor Bernáth,Tamás Zoltán Varga,Márk Czifra,Botond Szilágyi,Edith Alice Kovács
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 34 pages, 17 figures

点击查看摘要

Abstract:Nowadays refinery optimization utilizes sheer amounts of data, which can be handled with modern Linear Programming (LP) software, but the interpreting and applying the results remains challenging. Large petrochemical companies use massive models, with hundreds of thousands of input matrix elements. The LP solution is mathematically correct, but simplifications are made in the model, and data supply errors may occur. Therefore, further insight is needed to trust the results. The LP solver does not have a memory, so additional understanding could be gained by analyzing historical data and comparing it to the current plan. As such, machine learning approaches were suggested to support decision making based on the LP solution. Among these, Anomaly Detection tools are proposed to be used in tandem with the LP output. A transformed version of the popular ECOD methodology is applied. New methods are proposed to handle high-dimensional data: choosing the most informative pairs. Then, this is used alongside two 2D Anomaly Detection algorithms, revealing several business opportunities and data supply errors in the MOL refinery scheduling and planning architecture.

[LG-91] Averag e Gradient Outer Product in kernel regression provably recovers the central subspace for multi-index models

链接: https://arxiv.org/abs/2605.15082
作者: Libin Zhu,Damek Davis,Dmitriy Drusvyatskiy,Maryam Fazel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注: 95 pages, 12 figures

点击查看摘要

Abstract:We study a prototypical situation when a learned predictor can discover useful low-dimensional structure in data, while using fewer samples than are needed for accurate prediction. Specifically, we consider the problem of recovering a multi-index polynomial f^(x)=h(Ux) , with U\in\mathbbR^r\times d and r\ll d , from finitely many data/label pairs. Importantly, the target function depends on input x only through the projection onto an unknown r -dimensional central subspace. The algorithm we analyze is appealingly simple: fit kernel ridge regression (KRR) to the data and compute the Average Gradient Outer Product (AGOP) from the fitted predictor. Our main results show that under reasonable assumptions the top r -dimensional eigenspace of AGOP provably recovers the central subspace, even in regimes when the prediction error remains large. Specifically, if the target function f^ has degree p^* , it is known that n\asymp d^p^* samples are necessary for KRR to achieve accurate prediction. In contrast, we show that if a low degree p component of f^* already carries all relevant directions for prediction, subspace recovery occurs in the much lower sample regime n\asymp d^p+\delta for any \delta\in(0,1) . Our results thus demonstrate a separation between prediction and representation, and provide an explanation for why iterative kernel methods such as Recursive Feature Machines (RFM) can be sample-efficient in practice.

[LG-92] Multi-Block Attention for Efficient Channel Estimation in IRS-Assisted mmWave MIMO

链接: https://arxiv.org/abs/2605.15032
作者: Mehrdad Momen-Tayefeh,Mehrshad Momen-Tayefeh,Maryam Sabbaghian
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Intelligent Reflecting Surfaces (IRSs) are a promising technology for enhancing the spectral and energy efficiency of millimeter-wave (mmWave) multiple-input multiple-output (MIMO) systems. In these systems, accurate channel estimation remains challenging due to the passive nature of IRS elements and the high pilot overhead in large-scale deployments. This paper presents a deep learning-based Multi-Block Attention (MBA) framework for efficient cascaded channel estimation in IRS-assisted mmWave MIMO systems that utilize orthogonal frequency division multiplexing (OFDM). First, we show the optimality of the discrete Fourier transform (DFT) and Hadamard matrices as phase configurations for least squares (LS) estimation. To reduce training overhead, we selectively deactivate IRS elements and compensate for induced feature loss using a two-stage architecture: (i) a Convolutional Attention Network (CAN) for spatial correlation recovery and (ii) a Complex Multi-Convolutional Network (CMN) for noise suppression. The MBA architecture mitigates error propagation through attention-guided feature refinement and denoising. Simulation results indicate that the MBA method reduces pilot overhead by up to 87% compared to the LS estimator. Additionally, at signal-to-noise ratios of 10 dB, our proposed method achieves approximately 51% lower normalized mean squared error (NMSE) than leading methods. It also maintains low computational complexity and adapts effectively to various propagation environments.

[LG-93] Real-time virtual circuits for plasma shape control via neural network emulators

链接: https://arxiv.org/abs/2605.14939
作者: Alasdair Ross,George K. Holt,Kamran Pentland,Adriano Agnello,Nicola C. Amorisco,Pedro Cavestany,Aran Garrod,Timothy Nunn,Charles Vincent,Graham McArdle
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reliable position and shape control in tokamak plasmas requires accurate real-time regulation of several strongly coupled shape parameters. The control vectors that disentangle these couplings, referred to as \textitvirtual circuits (VCs), enable independent shape parameter control for a specific Grad–Shafranov (GS) equilibrium. Numerical calculation of VCs is not currently feasible in real time, therefore VCs are usually computed prior to each experiment, using a small number of reference GS equilibria sampled along the desired scenario trajectory, with each VC used to control the plasma within a preset time interval. While effective near the reference equilibrium, this approach can lead to degraded performance as the plasma departs from the reference equilibrium and/or from the desired trajectory, and it complicates the design of robust control strategies for rapidly evolving plasma configurations. In this paper, we construct neural-network-based emulators of plasma shape parameters from which VCs can be derived, to provide the MAST Upgrade (MAST-U) plasma control system with state-aware VCs in real-time. To do this, we develop an extensive library of over a million simulated GS equilibria, covering a substantial portion of the MAST-U operational space. These emulators provide differentiable functions whose gradients can be rapidly computed, enabling the derivation of accurate VCs for real-time shape control. We perform extensive verification of the emulated VCs by testing whether they disentangle the control problem. The neural-network-based approach delivers high accuracy and orthogonality across a diverse range of equilibria. This work establishes the physical validity of emulated VCs as a scalable and general alternative to schedules of precomputed VCs.

[LG-94] A Non-Monotone Preconditioned Trust-Region Method for Neural Network Training

链接: https://arxiv.org/abs/2605.14860
作者: Andrea Angino,Bindi Çapriqi,Shega Likaj,Ken Trotti,Rolf Krause
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures,

点击查看摘要

Abstract:Training deep neural networks at scale can benefit from domain decomposition, where the network is split into subdomains trained in parallel and coupled by a global trust-region mechanism. Building on the Additively Preconditioned Trust-Region Strategy (APTS), we propose a non-monotone variant with a nonlinear additive Schwarz preconditioner that combines parallel subdomain corrections with global coarse-space directions. A windowed acceptance criterion allows controlled objective increases, avoiding needless rejection of effective coarse steps. The resulting non-monotone APTS (NAPTS) preserves accuracy while reducing CPU time by 30% and cutting rejected steps to one third of those in APTS.

[LG-95] K-Models: a Flexible and Interpretable Method for Ordinal Clustering with Application to Antigen-Antibody Interaction Profiles

链接: https://arxiv.org/abs/2605.14828
作者: Giulia Patanè,Alessandra Menafoglio,Alexander Krauth,Peter Fechner,Luca Dede’,Bianca Maria Colosimo,Federica Nicolussi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Existing clustering methods for functional data often prioritize partitioning accuracy over interpretability, making it challenging to extract meaningful insights when the data-generating process follows a specific underlying structure and an ordinal relationship among clusters is suspected. This work introduces K-Models, a novel framework that integrates ordinal constraints and estimates key underlying elements of the random process generating the observed functional profiles, improving both interpretability and structure identification. The proposed method is evaluated through simulations and real-world applications. In particular, it is tested on Region of Interest (ROI) curves, which represent reaction profiles from a reflectometric sensor monitoring biomolecular interactions, such as antigen-antibody binding. These curves represent changes in reflected light intensity over time at multiple measurement spots with immobilized antigens during analyte exposure, capturing the binding dynamics of the system. The goal is to identify intrinsic signal patterns solely from the observed dynamics, making this dataset an ideal benchmark for assessing the added interpretability of the proposed approach. By incorporating structural assumptions into the clustering process, K-Models enhances interpretability while maintaining performance comparable to state-of-the-art techniques, providing a valuable tool for analyzing functional data with an underlying ordinal structure.

[LG-96] Scalable Solution of the Stochastic Multi-path Traveling Salesman Problem via Neural Networks

链接: https://arxiv.org/abs/2605.14662
作者: Xiaochen Chou,Ludovica Di Marco,Enza Messina
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The multi-path Traveling Salesman Problem with stochastic travel costs arises in hybrid vehicle routing applications designed for Smart City and City Logistics, where multiple paths exist between each pair of locations. Travel times along these paths are typically affected by real-time traffic conditions and therefore modeled as stochastic. The objective of the problem is to determine a Hamiltonian tour that minimizes the expected total travel cost under uncertainty. In this work, we adopt a two-stage stochastic programming formulation. In the first stage, a predefined route specifying the sequence of locations to be visited is determined, while taking into consideration a second-stage recourse problem that selects the optimal path from the feasible set of alternative paths for each pair of locations, once real-time traffic conditions are realized. To reduce the computational burden imposed by the large number of scenarios required to capture travel time uncertainty, the innovation of this work is the integration of neural network-based surrogate models to approximate the expected value of the second-stage recourse problem. Different architectures and training strategies for the neural networks are proposed and analyzed, with performance evaluated in terms of computation time, solution quality, and generalization capability. Preliminary findings demonstrate the enhanced scalability and practical applicability of the approach for complex vehicle routing problems under uncertainty.

[LG-97] All-atomistic Transferable Neural Potentials for Protein Solvation

链接: https://arxiv.org/abs/2605.14584
作者: Rishabh Dey,Salvina Sharipova,Konstantin Popov
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Implicit solvent models are widely used to decrease the number of solvent degrees of freedom and enable the calculation of solvation energetics without water molecules. However, its accuracy often falls short compared to explicit models. Recent advancements in neural potentials have shown promise in drug discovery, but transferability remains a persistent challenge. Here, we introduce the Protein Hydration Neural Network (PHNN), an implicit solvent model that extends analytical continuum solvation by learning transferable corrections to model parameters instead of applying post hoc adjustments to final energies. The model is explicitly designed to maximize data efficiency by leveraging physical priors embedded in the data. We demonstrate that PHNN improves accuracy relative to traditional analytical methods and maintains predictive accuracy on out-of-domain protein systems.

[LG-98] Scaling Laws from Sequential Feature Recovery: A Solvable Hierarchical Model

链接: https://arxiv.org/abs/2605.14567
作者: Arie Wortsman-Zurich,Hugo Tabanelli,Yatin Dandi,Florent Krzakala,Bruno Loureiro
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Probability (math.PR); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We propose a simple mechanism by which scaling laws emerge from feature learning in multi-layer networks. We study a high-dimensional hierarchical target that is a globally high-degree function, but that can be represented by a combination of latent compositional features whose weights decrease as a power law. We show that a layer-wise spectral algorithm adapted to this compositional structure achieves improved scaling relative to shallow, non-adaptive methods, and recovers the latent directions sequentially: strong features become detectable at small sample sizes, while weaker features require more data. We prove sharp feature-wise recovery thresholds and show that aggregating these transitions yields an explicit power-law decay of the prediction error. Technically, the analysis relies on random matrix methods and a resolvent-based perturbation argument, which gives matching upper and lower bounds for individual eigenvector recovery beyond what standard gap-based perturbation bounds provide. Numerical experiments confirm the predicted sequential recovery, finite-size smoothing of the thresholds, and separation from non-hierarchical kernel baselines. Together, these results show how smooth scaling laws can emerge from a cascade of sharp feature-learning transitions.

[LG-99] Large Dimensional Kernel Ridge Regression: Extending to Product Kernels

链接: https://arxiv.org/abs/2605.14524
作者: Yang Zhou,Yicheng Li,Yuqian Cheng,Qian Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent studies have reported \textitsaturation effects and \textitmultiple descent behavior in large dimensional kernel ridge regression (KRR). However, these findings are predominantly derived under restrictive settings, such as inner product kernels on sphere or strong eigenfunction assumptions like hypercontractivity. Whether such behaviors hold for other kernels remains an open question. In this paper, we establish a broad, new family of large dimensional kernels and derive the corresponding convergence rates of the generalization error. As a result, we recover key phenomena previously associated with inner product kernels on sphere, including: i) the \textitminimax optimality when the source condition s\le 1 ; ii) the \textitsaturation effect when s1 ; iii) a \textitperiodic plateau phenomenon in the convergence rate and a \textit multiple-descent behavior with respect to the sample size n .

[LG-100] ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing

链接: https://arxiv.org/abs/2605.14285
作者: Yixuan Jia,Siyi Chen,Yida Pan,Xiao Li,Lianghe Shi,Chanyong Jung,Haijie Yuan,Ismail Alkhouri,Yue Cynthia Wu,Saiprasad Ravishankar,Jeffrey A Fessler,Qing Qu
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data assimilation (DA) estimates the state of an evolving dynamical system from noisy, partial observations, and is widely used in scientific simulation as well as weather and climate science. In practice, filtering methods rely on frame-to-frame transition models. However, these models are fragile when observations are non-Markovian (when they form only a partial slice of a higher-dimensional latent state as in real-world weather data): they tend to accumulate errors over long horizons. At the same time, learned DA methods typically commit to a single regime, either filtering (nowcasting, real-time forecasting) or smoothing (retrospective reanalysis), which splits what should be a shared prior across application-specific pipelines. To address both issues, we introduce ForcingDAS, a unified and robust DA framework. Built on Diffusion Forcing with an independent noise level assigned to each frame, ForcingDAS learns a joint-trajectory prior instead of frame-to-frame transitions. This allows it to capture long-horizon temporal dependencies and reduce error accumulation. In addition, the same trained model spans the full filtering to smoothing spectrum at inference time. Specifically, nowcasting, fixed-lag smoothing, and batch reanalysis are selected through the inference schedule alone, without retraining. We evaluate ForcingDAS on 2D Navier-Stokes vorticity, precipitation nowcasting, and global atmospheric state estimation. Across all settings, a single model is competitive with or outperforms both learned and classical baselines that are specialized for individual regimes, with the largest gains observed on real-world weather benchmarks.

[LG-101] raining-Free Generative Sampling via Moment-Matched Score Smoothing

链接: https://arxiv.org/abs/2605.14276
作者: Zhenyu Yao,Daniel Paulin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 35 pages

点击查看摘要

Abstract:Diffusion models generate samples by denoising along the score of a perturbed target distribution. In practice, one trains a neural diffusion model, which is computationally expensive. Recent work suggests that score matching implicitly smooths the empirical score, and that this smoothing bias promotes generalization by capturing low-dimensional data geometry. We propose moment-matched score-smoothed overdamped Langevin dynamics (MM-SOLD), a training-free interacting particle sampler that enforces the target moments throughout the sampling trajectory. We prove that, in the large-particle limit, the empirical particle density converges to a deterministic limit whose one-particle stationary marginal is a Gibbs–Boltzmann density obtained by exponentially tilting a naive score-smoothed diffusion target. The mean and covariance of this distribution agree with the empirical moments of the training data. Experiments on 2D distributions and latent-space image generation show that MM-SOLD enables fast, robust, training-free sampling on CPUs, with sample fidelity and diversity competitive with neural diffusion baselines.

[LG-102] On the Burden of Achieving Fairness in Conformal Prediction

链接: https://arxiv.org/abs/2605.14260
作者: Ziang Gao,Pengqi Liu,Archer Yi Yang,Mouloud Belbahri,Jesse C. Cresswell,Masoud Asgharian
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Conformal prediction is often calibrated with a single pooled threshold, but this can hide cross-group heterogeneity in score distributions and distort group-wise coverage. We study this phenomenon through the population score distributions underlying split conformal calibration. First, we derive a conservation law and lower bound showing that pooled calibration incurs irreducible group-wise coverage distortion at a scale set by cross-group quantile heterogeneity. Second, we demonstrate that the two leading fairness definitions for conformal prediction, Equalized Coverage and Equalized Set Size, are fundamentally in tension. Third, we quantify the cost of moving between policies which treat groups separately or pool them. Experiments on synthetic and real data confirm the same bidirectional trade-off after finite-sample calibration. Our results show that, for the policy families studied here, calibration choice does not remove cross-group heterogeneity; it determines whether the resulting distortion appears in the coverage or size dimension, providing a principled lens for analyzing fairness-oriented calibration choices in practice.

[LG-103] o discretize continually: Mean shift interacting particle systems for Bayesian inference

链接: https://arxiv.org/abs/2605.14142
作者: Ayoub Belhadji,Daniel Sharp,Youssef M. Marzouk
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Integration against a probability distribution given its unnormalized density is a central task in Bayesian inference and other fields. We introduce new methods for approximating such expectations with a small set of weighted samples – i.e., a quadrature rule – constructed via an interacting particle system that minimizes maximum mean discrepancy (MMD) to the target distribution. These methods extend the classical mean shift algorithm, as well as recent algorithms for optimal quantization of empirical distributions, to the case of continuous distributions. Crucially, our approach creates dynamics for MMD minimization that are invariant to the unknown normalizing constant; they also admit both gradient-free and gradient-informed implementations. The resulting mean shift interacting particle systems converge quickly, capture anisotropy and multi-modality, avoid mode collapse, and scale to high dimensions. We demonstrate their performance on a wide range of benchmark sampling problems, including multi-modal mixtures, Bayesian hierarchical models, PDE-constrained inverse problems, and beyond.

[LG-104] Wahkon: A Statistically Principled Deep RKHS Superposition Network

链接: https://arxiv.org/abs/2605.14041
作者: Yongkai Chen,Wenxuan Zhong,Ping Ma
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep learning excels at prediction but often lacks finite-sample guarantees and calibrated uncertainty; RKHS (Reproducing Kernel Hilbert Space)-based methods provide those guarantees but struggle to adapt in high dimensions. We propose Wahkon, a deep RKHS superposition network that unifies Kolmogorov’s superposition principle with RKHS regularization in the smoothing-spline tradition of Wahba. This yields a finite-dimensional deep representer theorem that makes training tractable and provides explicit layerwise complexity control. We show the penalized estimator is exactly the MAP (maximum a posteriori) estimate under a hierarchical Gaussian-process prior, extending the spline/GP duality to deep compositions. Using metric-entropy arguments, we establish minimax-optimal convergence rates under mild smoothness and clarify how depth and width trade off with regularity. Empirically, Wahkon outperforms multilayer perceptrons, Neural Tangent Kernels, and Kolmogorov–Arnold Networks across simulation benchmarks and a single-cell CITE-seq study. By unifying Kolmogorov’s superposition principle with RKHS regularization, Wahkon delivers accuracy, interpretability, and statistical rigor in a single framework.

[LG-105] Regret Equals Covariance: A Closed-Form Characterization for Stochastic Optimization

链接: https://arxiv.org/abs/2605.14019
作者: Irene Aldridge
类目: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO)
*备注: 33 pages

点击查看摘要

Abstract:Regret is the cost of uncertainty in algorithmic decision-making. Quantifying regret typically requires computationally expensive simulation via Sample Average Approximation (SAA), with complexity \mathcalO(Bn^2d^3) in the number of scenarios B , variables n , and constraints d . % This paper proves that expected regret in any stochastic optimization problem admits the exact decomposition % \beginequation* \mathrmRegret© = \mathrmCov(c,,\pi^©) + R©, \endequation % where c is the vector of uncertain parameters, \pi^© is the optimal decision, and R© is a residual whose magnitude we bound explicitly under Lipschitz, smooth, and strongly convex conditions. % For linear programs and unconstrained quadratic programs, including the classical Markowitz portfolio problem, we prove R©=0 exactly, so that \mathrmRegret© = \mathrmCov(c,\pi^©) holds without approximation. % When historical cost-decision pairs (c_i, \pi^*(c_i))\ are available, the covariance can be estimated in \mathcalO(nd^2) time, which is orders of magnitude faster than SAA. The estimation is performed by a single pass through the data. % We derive concentration bounds, a central limit theorem, and an asymptotically unbiased residual estimator, and we validate all results on synthetic LP, QP, and integer programming instances and on a rolling-window portfolio experiment using ten years of CRSP equity data. Comments: 33 pages Subjects: Econometrics (econ.EM); Machine Learning (cs.LG); Statistics Theory (math.ST); Computation (stat.CO) Cite as: arXiv:2605.14019 [econ.EM] (or arXiv:2605.14019v1 [econ.EM] for this version) https://doi.org/10.48550/arXiv.2605.14019 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-106] Synthetic American Option Pricing via Jump-HMM-Driven Heston Implied Volatility

链接: https://arxiv.org/abs/2605.13998
作者: Julia Sun,Zheyu Jin,Jiawei Zhang,Jeffrey D. Varner
类目: Computational Finance (q-fin.CP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generating realistic synthetic option prices requires implied volatility as an input, yet implied volatility is itself derived from observed option prices, creating a circular dependency that limits synthetic data for machine-learning and risk-analysis applications. We break this circularity with a pipeline in which implied volatility emerges as an output of a structural model of equity returns. A Jump Hidden Markov Model produces multi-asset price paths with realistic stylized facts and cross-asset tail dependence; a modified Heston variance process, whose mean-reversion target depends on regime state, days to expiration, moneyness, and a market-mood indicator, converts those paths into implied-volatility paths; and a recombining binomial lattice prices American options from the resulting surface. Initializing variance at its mean-reversion target for each strike-expiration pair lets smile, skew, and term structure emerge without external calibration. We calibrate the shape function through a hierarchy spanning a parametric baseline, a globally shared neural surrogate, and a sector-specific neural surrogate fit to a multi-ticker, multi-sector option ladder. A temporal holdout on a multi-day capture isolated scheduled corporate events as the dominant source of test-time generalization error, and calendar-derived earnings-distance and same-sector peer-coupling features recovered the anticipatory portion of that signal. We then apply the framework as a synthetic-data generator on real near-the-money put and call contracts, forward-simulating price paths, and recovering path-conditional implied volatility, finite-difference American Greeks, and terminal short-premium profit and loss from one coherent simulation, and confirm cross-ticker robustness by re-running on a second underlying from a different sector and volatility regime. The framework is released as an open-source Julia package.

[LG-107] Winning Lottery Tickets in Neural Networks via a Quantum-Inspired Classical Algorithm

链接: https://arxiv.org/abs/2605.13979
作者: Natsuto Isogai,Hayata Yamasaki,Sho Sonoda,Mio Murao
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 28 pages, 3 figures

点击查看摘要

Abstract:Quantum machine learning (QML) aims to accelerate machine learning tasks by exploiting quantum computation. Previous work studied a QML algorithm for selecting sparse subnetworks from large shallow neural networks. Instead of directly solving an optimization problem over a large-scale network, this algorithm constructs a sparse subnetwork by sampling hidden nodes from an optimized probability distribution defined using the ridgelet transform. The quantum algorithm performs this sampling in time O(D) in the data dimension D , whereas a naive classical implementation relies on handling exponentially many candidate nodes and hence takes \exp[O(D)] time. In this work, we construct and analyze a quantum-inspired fully classical algorithm for the same sampling task. We show that our algorithm runs in time O(\operatornamepoly(D)) , thereby removing the exponential dependence on D from the previous classical approach. Numerical simulations show that the proposed sampler achieves empirical risk comparable to exact sampling from the optimized distribution and substantially lower than sampling from the non-optimized uniform distribution, while also exhibiting exponentially improved runtime scaling compared with the conventional classical implementation. These successful dequantization results show that sparse subnetwork selection via optimized sampling can be achieved classically with polynomial data-dimension scaling on conventional computers without quantum hardware, providing an alternative to the existing quantum algorithm.

[LG-108] A Survey on Data-Dependent Worst-Case Generalization Bounds

链接: https://arxiv.org/abs/2605.13913
作者: Hubert Leroux,Jean Marcus,Julien Roger
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 15 pages, 4 figures, 3 tables. The LaTeX source uses the JMLR preprint style ( this http URL ) and BibTeX ( this http URL ). Central references in arXiv form include arXiv:2404.17442 , arXiv:2006.09313 , arXiv:2302.02766 , arXiv:2407.08723 , and arXiv:2507.06775

点击查看摘要

Abstract:Deep neural networks generalize well despite being heavily overparameterized, in apparent contradiction with classical learning theory based on uniform convergence over fixed hypothesis spaces. Uniform bounds over the entire parameter space are vacuous in this regime, and recent work has shown that non-vacuous guarantees can be recovered by restricting attention to the part of parameter space that the algorithm actually visits. This survey paper organizes this line of work around three steps: extending PAC-Bayesian theory to random, data-dependent hypothesis sets (arXiv:2404.17442); refining the complexity term with geometric and topological descriptors of the optimization trajectory, including fractal dimensions, alpha-weighted lifetime sums, and positive magnitude (arXiv:2006.09313, arXiv:2302.02766, arXiv:2407.08723); and replacing the resulting information-theoretic terms by stability assumptions (arXiv:2507.06775). We unify these contributions around a single template inequality and a head-to-head comparison of the resulting bounds.

[LG-109] Feature Visualization Recovers Known Cortical Selectivity from TRIBE v2

链接: https://arxiv.org/abs/2605.13904
作者: Stuart Bladon,Brinnae Bent
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 8 pages, 3 figures, 2 tables. Code available at this https URL

点击查看摘要

Abstract:Brain encoder models predict cortical fMRI responses from the internal activations of pretrained vision and language networks, and are typically evaluated by held-out prediction accuracy. This is a useful signal for training but a poor one for interpretation: it tells us an encoder fits the data without telling us whether it has internalized the functional organization of the brain. We propose feature visualization – gradient ascent on the encoder’s predicted activation for a target region of interest (ROI) – as a complementary interpretability technique, and apply it to TRIBE v2 composed with V-JEPA 2 (ViT-G, 40 layers), holding both frozen and synthesizing still images for seven regions spanning the ventral and dorsal visual hierarchies. Under identical hyperparameters, the probe recovers a visible progression of increasing spatial scale and feature complexity across V1 to V4, matching the ventral-stream hierarchy. It also produces three distinctive downstream regimes: radial “frozen-motion” streaks for the middle temporal area (MT) despite static-only optimization, face-like features for the fusiform face area (FFA), and consistent rectilinear line patterns for the parahippocampal place area (PPA). Optimized FFA stimuli drive the predicted region ~4x as much as a natural face photograph, consistent with feature visualization producing adversarial super-stimuli rather than canonical exemplars. The probe is simple, differentiable, and applicable to any brain encoder with a differentiable backbone, allowing for qualitative evaluation of brain encoders.

[LG-110] Attention-Based Multimodal Survival Prediction with Cross-Modal Bilinear Fusion

链接: https://arxiv.org/abs/2605.13897
作者: Hassan Keshvarikhojasteh,Josien P.W. Pluim,Mitko Veta
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel multimodal deep learning framework for patient-level survival prediction, which integrates whole-slide histology features, RNA-seq expression profiles, and clinical variables. Our architecture combines an ABMIL module~\citeilse2018attention for slide-level representation with feedforward encoders for RNA and clinical data. These embeddings are then integrated through low-rank bilinear cross-modal fusion~\citeliu2018efficient to model conditional interactions across modalities while controlling parameter growth. The model outputs continuous risk scores that are subsequently mapped to survival times using a nonparametric calibration procedure based on the Kaplan–Meier estimator~\citekaplan1958nonparametric. By decomposing multimodal reasoning into independent pairwise interactions, the proposed fusion design promotes structural interpretability and parameter efficiency compared with full tensor and hierarchical fusion strategies. Experiments on the CHIMERA challenge dataset demonstrate improved predictive performance over concatenation-based baselines and competitive generalization on hidden evaluation cohorts. These results indicate that the proposed framework is a promising approach for multimodal survival prediction in HR-NMIBC. The implementation is publicly available at this https URL.

[LG-111] Phylogenetic Tree Inference with Tropical Axial Attention

链接: https://arxiv.org/abs/2605.13894
作者: Chris Teska,Kurt Pasque,Ruriko Yoshida,Baran Hashemi
类目: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this work, we introduce a Tropical Axial Attention neural reasoning architecture that replaces vanilla softmax dot-product attention with max-plus operators, inducing a piecewise-linear structure aligned with dynamic programming formulations. From multi-species sequence alignments, our model learns all possible pairwise distances and is trained using a combination of \ell_1 and tropical symmetric distance metric losses with an ultrametric violation penalty. We leverage the well known isomorphic relationship between the space of all phylogenetic trees with n species and tropical Grassmannian to show that tropical attention provides a natural geometric framework for phylogenetic inference. On empirical DS1-DS11 alignments, where true trees are unknown, the tropical model produces distance matrices that are substantially closer to their BME-induced tree metrics than the baseline models. These results suggest that tropical attention is a useful geometric inductive bias for neural phylogenetic inference, especially under distribution shift and when tree-metric consistency is important. Subjects: Populations and Evolution (q-bio.PE); Machine Learning (cs.LG) Cite as: arXiv:2605.13894 [q-bio.PE] (or arXiv:2605.13894v1 [q-bio.PE] for this version) https://doi.org/10.48550/arXiv.2605.13894 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

附件下载

点击下载今日全部论文列表