本篇博文主要内容为 2026-06-02 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-06-02)
今日共更新1622篇论文,其中:
- 自然语言处理共268篇(Computation and Language (cs.CL))
- 人工智能共577篇(Artificial Intelligence (cs.AI))
- 计算机视觉共364篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共523篇(Machine Learning (cs.LG))
- 多智能体系统共36篇(Multiagent Systems (cs.MA))
- 信息检索共32篇(Information Retrieval (cs.IR))
- 人机交互共49篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents
【速读】:该论文旨在解决现有医疗大模型评估基准在模拟真实临床决策过程中的局限性问题。传统静态基准无法捕捉医生在不确定性环境下逐步、不可逆地整合异构信息并做出连续决策的动态特性,而现有交互式基准则在至少一个关键维度上存在妥协。为此,作者提出ClinEnv,一种基于“纵向住院模拟”(Longitudinal Inpatient Simulation)范式的交互式评估框架,将真实的住院病例自动构建为有序的决策阶段序列。在每个阶段,模型需主动向四个专业化代理查询信息后,才能决定用药、操作及诊断。该评估体系不仅通过确定性本体对齐方式衡量决策结果质量,还量化信息获取过程的质量。实验表明,在七种模型中最强者仅达到0.31的决策F1分数,且结果质量与决策过程质量显著脱钩;模型在管理类决策和后期阶段表现尤为薄弱,虽能较可靠地恢复出院诊断(F1=0.51),但对管理措施的决策准确率极低(F1=0.17),且随病例推进仍持续发出冗余查询。ClinEnv成功揭示了传统仅依赖结果评价所忽视的信息获取能力缺口,并使其可直接测量。
链接: https://arxiv.org/abs/2606.02568
作者: Yuxing Lu,Yushuhong Lin,Wenqi Shi,J. Ben Tamo,Xukai Zhao,Jinzhuo Wang,May Dongmei Wang
机构: Georgia Institute of Technology(佐治亚理工学院); Peking University(北京大学); University of Texas Southwestern Medical Center(德克萨斯西南医学中心); Tsinghua University(清华大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Multiagent Systems (cs.MA)
备注: 20 pages, 6 figures, 12 tables
Abstract:Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.
[MA-1] ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning ACL2026
【速读】:该论文旨在解决大语言模型(LLM)在表格问答任务中难以进行面向未来的数值预测这一关键问题。现有系统普遍缺乏对时间序列数据的建模能力,无法支持对未来趋势的推理与预测,导致其在真实场景中的应用受限。为应对这一挑战,研究提出了一项新任务——开放域表格问答中的未来数据预测与推理,并构建了首个基于房地产真实数据的时间序列预测与基于预测的推理数据集。该任务的核心难点在于精准检索历史数据、突破大语言模型固有的预测能力局限性,以及统一多样查询下的回答格式。为此,作者提出TimeFore框架,其关键创新在于采用基于大语言模型代理(LLM agent)的协同架构,将问题分解为三个角色:检索器(Retriever)自主生成SQL语句以获取相关历史数据;预测器(Forecaster)调用外部时间序列模型以提升预测精度;分析器(Analyzer)则整合多源信息,生成准确且一致的最终答案。实验结果表明,TimeFore在复杂预测与推理任务中显著优于现有方法。
链接: https://arxiv.org/abs/2606.02433
作者: Zhensheng Wang,Xiaole Liu,Wenmian Yang,Kun Zhou,Yiquan Zhang,Weijia Jia
机构: Beijing Normal University(北京师范大学); Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: This paper has been accepted by Findings of ACL 2026
Abstract:The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.
[MA-2] A Game-Theoretic Decision Framework for Optimal Selection of Coordination Detection Methods in Multi-UAV Fleet Operations
【速读】:该论文旨在解决无人机群(UAV fleet)在共享空域中协同行为检测与航路主导机识别时面临的速度-精度权衡难题:快速方法虽可支持实时交通管理,但牺牲了检测精度;而高精度方法往往超出可行动空域冲突解脱的时间预算。其解决方案的关键在于提出一种博弈论决策框架,将方法选择建模为监测者(Monitor)与自然(Nature)之间的双人零和博弈,其中监测者优化计算方法及参数组合以应对未知的交通场景。通过构建从轨迹监控数据到八种候选检测算法、蒙特卡洛敏感性分析以及多目标优化层的端到端流程,该框架利用极小极大解生成一个覆盖多种方法的概率混合策略,确保在任意场景不确定性下仍能实现最差情况下的性能保障。实验结果表明,该框架可根据不同操作优先级推荐差异化的方法组合——在平衡型与速度优先型场景中,Koopman相位法表现最优;而在航路主导机识别优先时,复杂递归量化分析(CRQA)成为主要选择,并在所有测试偏好配置下实现了0.29–0.53(归一化效用)的保证博弈值,首次为无人机交通管理中的计算方法选择提供了可解释、场景自适应的理论方法。
链接: https://arxiv.org/abs/2606.02383
作者: Christian Manasseh
机构: 未知
类目: Multiagent Systems (cs.MA); Computer Science and Game Theory (cs.GT); Systems and Control (eess.SY)
备注:
Abstract:Detecting coordination among unmanned aerial vehicle (UAV) fleets operating in shared airspace and identifying the route-lead aircraft whose navigation decisions govern fleet behavior presents a fundamental speed–accuracy trade-off: fast methods enable real-time traffic management but sacrifice detection fidelity, while accurate methods may exceed the time budget for actionable airspace deconfliction. This paper presents a game-theoretic decision framework that resolves this trade-off by formulating method selection as a two-player zero-sum game between a Monitor (selecting computational methods and parameters) and Nature (selecting the unknown traffic scenario). We construct an end-to-end pipeline from trajectory surveillance data through eight candidate detection algorithms, a Monte Carlo sensitivity analysis characterizing their stochastic performance, and finally a multi-objective optimization layer that identifies Pareto-optimal method portfolios. The minimax solution provides a robust mixed strategy with a probability distribution over methods that guarantees worst-case performance regardless of scenario uncertainty. Experimental evaluation across 200 randomized configurations spanning 5–50 aircraft demonstrates that the framework recommends distinct method portfolios depending on operational priority: Koopman Phase dominates balanced (70.6%) and speed-priority (79.7%) profiles, while CRQA emerges as primary (47.4%) when route-lead identification is prioritized. The framework achieves a guaranteed game value of 0.29–0.53 (normalized utility) across all tested preference profiles, providing the first principled, scenario-adaptive methodology for computational method selection in UTM fleet monitoring operations.
[MA-3] Agent ic-J: An AI Agent for Biological Microscopy Image Analysis
【速读】:该论文旨在解决生物图像分析中跨异构工具、编程环境及领域知识集成困难的问题,这一挑战使得多数研究人员难以独立完成复杂分析任务。其核心解决方案是提出Agentic-J——一个容器化部署的多智能体人工智能助手,专为ImageJ/Fiji平台设计,支持生物学家通过自然语言描述从核分割、细胞追踪到多条件定量分析等任务。系统通过专业化子智能体协同工作,分别负责插件管理、代码生成、调试、质量保障与统计报告,自动生成结构化、可追溯且可复现的执行脚本与文档化项目框架。该方案的关键在于实现“自然语言驱动的自动化工作流构建”,将领域知识与技术实现解耦,显著降低生物图像分析的技术门槛并提升研究可重复性。
链接: https://arxiv.org/abs/2606.02080
作者: Lukas Johanns,Marilin Moor,Davide Panzeri,Yu Zhou,Xinyi Chen,Nora F. K. Pauly,Zixuan Pan,Matthias Gunzer,Andreas Müller,Yiyu Shi,Hedi Peterson,Jianxu Chen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at Cell Biology at Scale 2026 (Poster). The Agentic-J project is available at this https URL
Abstract:Biological image analysis increasingly demands integration across heterogeneous tools, programming environments, and domain knowledge that few researchers can command simultaneously. We present Agentic-J, a containerised, multi-agent AI assistant, primarily for ImageJ/Fiji that enables biologists to specify analysis tasks in natural language, from nuclei segmentation and cell tracking to multi-condition quantification. The agent generates executable scripts organised into a documented project structure, so every analysis decision is traceable and the workflow can be reproduced or shared. The specialised sub-agents handle plugin management, code generation, debugging, quality assurance, and statistical reporting. In this paper we introduce the system’s design, demonstrate real biological microscopy image analysis workflows, and detailed the technical implementation.
[MA-4] World-Task Factorization for Robot Learning
【速读】:该论文旨在解决机器人学习中政策(policy)在面对新约束、新队友及新环境组合时的泛化能力不足问题。其核心挑战在于如何通过合理的结构化因子分解(structural factorization),明确界定哪些部分应具备通用性、哪些需依赖特定训练、哪些存在纠缠关系。论文提出最根本的因子分解方式是将“世界”与“任务”相分离:世界因子(world factors)描述具身系统与环境的固有属性,独立于意图存在;任务因子(task factors)则由任务逻辑决定,反映世界所允许的行为空间。这一不对称性通过贝叶斯模型证据(Bayesian model evidence)形式化,确保与数据生成过程一致,维持高似然性,并降低奥卡姆剃刀对任务参数的惩罚。关键解决方案为构建一个可微分的递归估计器图(AICON),该图具有组合性、无需任务特定数据即可运行,并能将代价梯度传播至执行器;同时配合一个紧凑的、可学习的策略网络,用于调制梯度路径。梯度在此作为双因子间的接口:承载世界结构通过图结构传递,同时携带任务结构通过代价函数体现,从而实现低维学习的同时保持结构化泛化能力。实验在异构机器人、环境、任务逻辑和感知-运动模态的三类任务上验证了该框架,结果表明其优于端到端基线与解析启发式方法,在分布外配置下实现零样本泛化,并可在不重新训练的情况下迁移至真实硬件平台。
链接: https://arxiv.org/abs/2606.02027
作者: Eduardo Sebastián,Adrian Pfisterer,Vito Mengers,Oliver Brock,Amanda Prorok
机构: University of Cambridge (剑桥大学); Technische Universität Berlin (柏林工业大学); Science of Intelligence (SCIoI), Cluster of Excellence (智能科学卓越中心, 优秀集群); Robotics Institute Germany (德国机器人研究所)
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task’s logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor’s penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.
[MA-5] A Simple Hierarchical Causality Primer
【速读】:该论文旨在解决复杂系统中层级因果关系的形式化问题,核心挑战在于如何准确描述不同层级上“行为者”(actor)的角色如何约束、选择并组织“代理”(agent)在各层级上的局部动态行为。传统方法往往忽视了层级间因果影响的抽象与传递机制,导致对系统整体行为的理解存在断裂。其解决方案的关键在于引入三个基本结构:一是因果类(causation classes),用于抽象行为者所体现的特定因果影响力形式;二是聚合算子(aggregation operators),实现跨层级的状态或行为信息传递;三是离散事件时间映射(discrete event-time maps),以明确局部事件计数与全局时钟之间的关系。该框架采用简洁且离散的形式,为复杂系统中多层级因果关系的建模提供了可操作的理论基础。
链接: https://arxiv.org/abs/2606.01979
作者: Tim Gebbie
机构: University of Cape Town (开普敦大学)
类目: Multiagent Systems (cs.MA); Optimization and Control (math.OC); Computational Finance (q-fin.CP)
备注: 8 pages, 1 figure; short technical primer with a toy example in an appendix
Abstract:We provide a brief primer for the idea behind formalising hierarchical causality in the context of complex systems. Here actors are not simply agents. Actors instantiate causation classes. Agents implement local dynamics in given levels or organisation in a given system. Hierarchical causality then describes how actor-level roles constrain, select, and organise agent-level behaviour across levels. The system then necessarily requires three additional structures. First, causation classes to abstract a given form of causal influence that an actor instantiates. Second, aggregation operators to move across the levels. Third, discrete event-time maps are required because the system comprises events, and the relation between local event counts and any global clock must be specified. Our formulation here is purposefully simple and discrete.
[MA-6] Market-Based Replanning for Safety-Critical UAV Swarms in Search and Rescue Missions
【速读】:该论文旨在解决搜救(Search and Rescue, SAR)任务中无人机蜂群在面临个体故障或性能退化时,如何实现容错协同以保障任务连续性的关键问题。其核心挑战在于资源受限环境下,如何确保蜂群具备自主重构与动态重分配能力,从而维持整体任务成功率。解决方案的关键在于提出一种分布式协调架构——智能重规划无人机蜂群(Intelligent Replanning Drone Swarm, IRDS),该架构融合反向拍卖(Reverse-Auction)市场机制与几何一致性协议:前者通过距离加权成本函数使剩余健康无人机自主竞标搜索区域任务,实现高效的任务再分配;后者则基于几何共识机制完成对目标位置的分布式验证,提升定位可靠性。仿真结果表明,在8架无人机、8×8网格环境下的随机故障注入测试中,系统可在极短延迟内完成故障代理的任务转移,即便在25%人员损失条件下仍保持93%的任务成功率,验证了该框架在自愈性空中机器人协同中的鲁棒性与有效性。
链接: https://arxiv.org/abs/2606.01970
作者: Luiz Giacomossi,Andrea Haglund,Claire Namatovu,Emily Zainali,Esaias Målqvist,Yonatan M. Beyene,Ivan Tomasic,Baran Çürüklü,Håkan Forsberg
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 6 pages, 4 figures, accepted at MIPRO 2026
Abstract:Reliable autonomous UAV swarms in Search and Rescue (SAR) missions require fault-tolerant coordination capable of sustaining operations despite agent degradation. This paper introduces the Intelligent Replanning Drone Swarm (IRDS), a distributed coordination architecture designed for resource-constrained environments. The proposed framework employs a Reverse-Auction market mechanism where agents bid to service search sectors based on a distance-weighted cost function, coupled with a geometric consensus protocol for target verification. We evaluate the approach through physics-based simulations (N=8 agents, 8x8 grid) subjected to stochastic fault injection. Results indicate that the swarm autonomously reallocates tasks from failed agents with low latency relative to the total mission duration, maintaining a mission success rate of 93% under 25% workforce degradation. The proposed framework demonstrates a robust, empirically tested method for self-healing aerial robotic coordination.
[MA-7] QoEReason er: An Agent ic Reasoning Framework for Automated and Explainable QoE Diagnosis in RANs
【速读】:该论文旨在解决在实际运行的无线接入网(Radio Access Network, RAN)中,用户体验质量(Quality-of-Experience, QoE)退化问题的诊断难题。传统方法依赖人工专家对高维、跨层的遥测数据进行复杂分析,效率低且难以规模化。尽管大语言模型(Large Language Models, LLMs)具备强大的推理能力,但其在原始网络时间序列分析、因果链推断中的幻觉现象以及缺乏状态保持机制等问题,使其难以胜任RAN故障定位任务。为此,论文提出QoEReasoner——一个端到端的、基于LLM驱动的智能体系统,用于实现自动化且可解释的QoE诊断。其核心解决方案在于通过物理网络真实性的约束来驯服LLM的不确定性:采用确定性工具将原始关键性能指标(KPIs)转化为结构化证据,借助领域专用知识库确保故障传播符合协议规范,并利用历史专家验证案例库引导假设生成;同时,由一个状态化的中央规划器协调异常检测、因果追溯与根因定位的闭环流程。实验结果表明,QoEReasoner在多个诊断任务上的准确率优于强基线18%–40%,并将诊断时间从约30分钟的人工分析缩短至每会话仅3分钟,生成可解释的专家级报告,且对不同LLM底座具有良好的鲁棒性。
链接: https://arxiv.org/abs/2606.01925
作者: Qizhe Li,Haolong Chen,Shan Dai,Zhuo Li,Zhiwei Hu,Xuan Li,Guangxu Zhu,Qingjiang Shi
机构: The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)); Shenzhen Research Institute of Big Data(深圳市大数据研究院); Huawei Technologies Co., Ltd.(华为技术有限公司); Tongji University(同济大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Diagnosing Quality-of-Experience (QoE) degradations in operational Radio Access Networks (RANs) is a critical but notoriously complex task, traditionally requiring labor-intensive expert analysis over high-dimensional, cross-layer telemetry. While Large Language Models (LLMs) offer unprecedented reasoning capabilities, they are fundamentally unsuited for raw RANs troubleshooting: they fail at numeric time-series analysis, hallucinate protocol-violating causal links, and lack the stateful rigor required for multi-step fault localization. To bridge this gap, we present QoEReasoner, an end-to-end, LLM-driven agentic system designed for automated and explainable QoE diagnosis. QoEReasoner tames the inherent unpredictability of LLMs by grounding their reasoning in the physical realities of the network. It employs deterministic tools to reliably translate raw numeric KPIs into structured evidence, enforces protocol-consistent fault propagation through a domain-specific Knowledge Base, and leverages a Historical Bank of expert-validated cases to guide hypothesis generation. A stateful central planner orchestrates this closed-loop process across anomaly detection, causal tracing, and root-cause localization. Evaluations on real-world operational RANs datasets demonstrate that QoEReasoner outperforms strong baselines by 18%-40% in accuracy across multiple diagnostic tasks. Furthermore, it reduces diagnostic time from approximately 30 minutes of manual expert analysis to just 3 minutes per session, delivering highly interpretable, expert-grade reports while remaining robust across diverse LLM backbones.
[MA-8] RadioMaster: Multi-Agent System for Autonomous Radio Signal Generation
【速读】:该论文旨在解决无线原型设计中将用户意图转化为物理无线电信号这一关键但复杂繁琐的难题,其核心挑战在于传统方法依赖深厚的物理层知识且面临巨大的实现障碍。现有大语言模型(Large Language Models, LLMs)与多智能体系统虽在软件工程领域取得突破,但在无线电信号生成任务中因严重缺乏领域知识(domain ignorance)及对物理硬件约束的敏感性不足而表现不佳。为此,论文提出RadioMaster——一个全自主的多智能体框架,其解决方案的关键在于三个协同运作的核心组件:基于领域知识检索的RadioWiki、负责协同生成基带I/Q样本并配置硬件的RadioAgent,以及实现闭环物理层验证的RadioEmulator。通过构建首个专用于无线电信号生成领域的基准测试集RadioBench,实证表明RadioMaster在配置可行性与信号保真度方面显著优于当前最先进的基线方法。
链接: https://arxiv.org/abs/2606.01862
作者: Jiazhen Lei,Tianze Cao,Yuxin Sha,Sihan Wang,Bingbing Wang,Fengyuan Zhu,Zeming Yang,Xiaohua Tian
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Translating user intents into physical radio signals represents the critical yet notoriously tedious final step in wireless prototyping, as it requires intricate knowledge of physical layer details and presents immense implementation challenges. Large Language Models (LLMs) and multi-agent systems have revolutionized conventional software engineering, raising the compelling question of whether they can resolve these formidable difficulties. However, our investigations reveal that current models experience significant limitations and fail to accomplish this task when applied to radio signal generation. This performance degradation primarily stems from severe domain ignorance and a fundamental insensitivity to physical hardware constraints. To bridge this gap, we introduce RadioMaster, a fully autonomous multi-agent framework designed to seamlessly translate user input into real-world wireless emissions. RadioMaster operates on three synergistic pillars: RadioWiki for domain-specific knowledge retrieval, RadioAgent for collaborative I/Q sample generation alongside hardware configuration, and RadioEmulator for closed-loop physical layer verification. Furthermore, we construct RadioBench, the first comprehensive benchmark tailored specifically for the radio signal generation domain. Extensive real-world evaluations demonstrate that RadioMaster significantly outperforms state-of-the-art (SOTA) baselines regarding configuration viability and signal fidelity.
[MA-9] From Global Policies to Local Strategies: Multi-Objective Optimization of Resource-Specific Handover Policies
【速读】:该论文旨在解决业务流程管理中资源分配效率低下问题,尤其关注传统强化学习(Reinforcement Learning, RL)方法在处理多资源协作模式时忽视任务交接过程中人际协同关系的缺陷。其核心挑战在于如何在多目标优化框架下实现对个体资源(如人员)层面的精细化决策支持,以提升整体流程的成本效益、吞吐时间与资源利用率。解决方案的关键在于首次提出一种基于多智能体系统(Multi-Agent System, MAS)过程仿真器与多目标进化算法(Multi-Objective Evolutionary Algorithm, MOEA)相结合的方法,能够生成帕累托最优(Pareto-optimal)且面向具体资源的分配策略,从而显式建模并优化跨资源间的协作行为。实验结果表明,该方法在合成数据集和真实世界数据集上平均降低37%的成本与58%的等待时间,显著优于启发式基准,验证了融合协作感知优化对提升流程性能的有效性。
链接: https://arxiv.org/abs/2606.01857
作者: Lukas Kirchdorfer,Artemis Doumeni,Han van der Aa,Hugo A. López
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Efficient resource allocation is a key challenge in business process management, with direct implications for cost, throughput time, and utilization. While recent Reinforcement Learning (RL) approaches have shown promise in deriving adaptive allocation policies, they typically neglect inter-resource collaboration patterns that can strongly influence real-world task handovers. Recognizing this, this paper introduces the first approach for multi-objective optimization of resource-level decision-making, enabling the recommendation of person-specific handover policies. To achieve this, our work combines an existing Multi-Agent System-based process simulator with a multi-objective evolutionary algorithm. The resulting approach produces Pareto-optimal, resource-specific policies that optimize the process across multiple objectives. Experimental results on synthetic and real-world datasets show that our approach reduces costs by an average of 37% and waiting time by 58%, consistently outperforming heuristic baselines and demonstrating the potential of leveraging collaboration-aware optimization to improve process performance.
[MA-10] Dynamic Trust-Aware Sparse Communication Topology for LLM -Based Multi-Agent Consensus
【速读】:该论文旨在解决大规模语言模型驱动的多智能体系统在复杂推理任务中因全连接通信机制导致的通信开销过高的问题。现有框架普遍采用全连接通信模式,使得消息数量、令牌消耗及端到端延迟随智能体数量呈近似二次增长,而固定稀疏拓扑虽可降低开销,却无法根据任务实例或中间推理状态动态调整通信关系,易造成低价值交互保留或关键纠错信息丢失。为此,论文提出动态稀疏共识机制DySCo(Dynamic Sparse Consensus),其核心在于:在每轮推理中,基于智能体可靠性、答案分歧度与任务相关性动态评估通信边的价值,并在预算约束下选择高价值边进行信息交换;同时,通过动态信任权重聚合各智能体的输出,并在共识稳定时提前终止讨论。该机制将全局广播替换为按需通信,在显著降低通信开销的同时,有效保留了必要的交叉验证信息,实现了高效且可靠的多智能体协作。
链接: https://arxiv.org/abs/2606.01828
作者: Wanshuang Gou,Zihan Liu
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, 5 tables
Abstract:Large language model-driven multi-agent systems enhance the reliability of complex reasoning tasks through multi-round deliberation, role specialization, and cross-validation. However, existing multi-agent debate and collaboration frameworks typically adopt fully connected communication, causing the number of messages, token costs, and end-to-end latency to grow approximately quadratically with the number of agents; although fixed sparse topologies reduce overhead, they cannot adapt communication relationships to different task instances or intermediate reasoning states, making them prone either to preserving low-value interactions or to losing critical error-correction information. To address this problem, this paper proposes DySCo (Dynamic Sparse Consensus), a dynamic trust-aware sparse consensus mechanism. In each round of reasoning, DySCo estimates the value of communication edges based on agent reliability, answer divergence, and task relevance, and selects a small number of high-value edges for message exchange under budget constraints; it then aggregates the answers of different agents through dynamic trust weights and terminates the discussion early once consensus stabilizes. This mechanism replaces universal broadcasting with on-demand communication, thereby reducing communication overhead while preserving essential cross-validation information. We further present analyses of communication complexity and consensus stability, and evaluate the performance of DySCo on mathematical reasoning, logical reasoning, and factual question-answering tasks.
[MA-11] MetaForge: A Self-Evolving Multimodal Agent that Retrieves Adapts and Forges Tools On Demand
【速读】:该论文旨在解决多模态智能体在复杂推理任务中因工具库静态预设而难以泛化至未见场景,以及盲目调用工具导致冗余开销与噪声干扰错误的核心问题。其解决方案的关键在于提出MetaForge框架,通过将智能体行为解耦为四个协同阶段——决策(判断是否需要调用工具)、检索(选择合适工具)、适应(根据任务上下文调整工具参数)和锻造(在线合成新技能并回填至工具库以供复用),构建了一个闭环的“判-检-适-锻-复用”循环机制。该框架引入统一的编排策略,使智能体可自主决定直接回答、复用已有工具或生成新技能;并通过强化学习联合优化工具调用必要性、检索准确性、执行有效性及新技能可复用性,同时施加显式的调用成本惩罚以抑制冗余调用。实验在12个基准测试上验证了MetaForge在准确性、效率与泛化能力方面均显著优于16个基线模型,标志着从静态工具库向按需自进化工具体系的范式转变。
链接: https://arxiv.org/abs/2606.01801
作者: Shouang Wei,Houcheng Min,Xinpeng Dong,Xin Lin,Sen Cui,Bo Jiang,Zhongxiang Dai,Kun Kuang,Guandong Xu,Fei Wu,Min Zhang
机构: 未知
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Multimodal agents have achieved notable progress on complex reasoning tasks through tool use, yet remain limited by two issues: statically predefined tool inventories fail to generalize to unseen scenarios, and indiscriminate tool invocation incurs redundant cost and noise-induced errors. We propose MetaForge, a multimodal agent framework that learns when to invoke tools and how to evolve its toolset on demand. MetaForge factorizes agentic behavior into four coupled stages: Decide (judging whether tool use is warranted), Retrieve (selecting suitable tools), Adapt (grounding tool parameters in task context), and Forge (synthesizing new skills online and recycling them into the tool library for reuse), forming a closed judge-retrieve-adapt-forge-recycle loop. A unified orchestration policy enables the agent to choose among answering directly, reusing existing tools, or forging new ones. We jointly optimize invocation necessity, retrieval accuracy, execution effectiveness, and forged-skill reusability via reinforcement learning, with an explicit invocation-cost penalty discouraging redundant calls. Across 12 benchmarks, MetaForge consistently surpasses 16 baselines in accuracy, efficiency, and generalization, validating a paradigm shift from static tool inventories to on-demand self-evolution.
[MA-12] A Sheaf Framework for Strategic Multi-Agent Systems: From Consensus to Nash Equilibria
【速读】:该论文旨在解决异构自主代理在动态对抗环境中协同决策时,如何同时满足几何约束、逻辑一致性、时间推理与策略优化的复杂问题。现有层化(sheaf)与拓扑(topos)理论框架虽能有效处理几何共识、知识对齐与因果规划,但缺乏对价值、奖励机制及战略选择的显式建模。其解决方案的关键在于构建一个统一的范畴论框架,将事件演算(event calculus)、类SCEL的群体形成机制与博弈论奖励结构整合至一个时空历史的格罗滕迪克拓扑(Grothendieck topos)中。核心创新在于引入“博弈层化”(game sheaf)概念,其截面(stalks)包含效用函数与策略分布,限制映射(restriction maps)同时编码平行传输与最优响应动力学。研究证明,纳什均衡对应于导出的最优响应对应层化的全局截面,而上同调障碍则用于分类战略一致性的失效情形。通过免疫学“堡垒防御”场景的详细案例分析,验证了该框架在资源受限下异构代理形成攻防群体时的表达能力与有效性,为可验证、自管理且经济理性的多智能体系统提供了严格的数学基础。
链接: https://arxiv.org/abs/2606.01663
作者: Manuel Hernández,Eduardo Sánchez-Soto
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注:
Abstract:The coordination of heterogeneous autonomous agents in dynamic, adversarial environments requires simultaneous satisfaction of geometric constraints, logical consistency, temporal reasoning, and strategic optimization. Existing sheaf- and topos-theoretic frameworks provide powerful tools for geometric consensus, knowledge alignment, and causal planning, but lack explicit models for value, reward, and strategic choice. This report presents a unified categorical framework that integrates event calculus, SCEL-like ensemble formation, and game-theoretic reward structures into a single Grothendieck topos of time-space histories. We introduce the notion of a \emphgame sheaf whose stalks contain utility functions and policy distributions, and restriction maps encode both parallel transport and best-response dynamics. We prove that Nash equilibria correspond to global sections of a derived best-response correspondence sheaf, while cohomological obstructions classify failures of strategic consistency. A detailed case study of an immunological ``bastion defense’’ scenario – heterogeneous agents forming attack/defense ensembles under resource constraints – demonstrates the framework’s expressiveness. This synthesis provides a rigorous foundation for verifiable, autonomic, and economically rational multi-agent systems.
[MA-13] chGraphRAG : An Agent ic Graph-Augmented RAG Framework for Technical Literature Reasoning
【速读】:该论文旨在解决在特定技术领域(智能轮胎、车辆动力学与车辆控制)中,面对大规模、高复杂性学术文献时,传统单次检索增强生成(RAG)系统难以实现精准、可验证且具备推理能力的技术问答与知识推理问题。其核心挑战在于如何确保生成内容的证据充分性、引用准确性及上下文连贯性,同时克服信息孤岛与知识碎片化。解决方案的关键在于提出一个13步自主代理式检索增强生成(agentic RAG)框架,通过多维度证据充分性评分(100分制,涵盖相关性、完整性、时效性等五个维度)、基于图谱的关联上下文挖掘(利用Neo4j构建的语义知识图谱)、外部数据库的迭代优化-搜索-验证循环(集成Crossref、OpenAlex、Semantic Scholar),以及具备漂移防护机制的查询重构重试策略,实现了从查询理解到结果生成的全流程自动化与自我修正。尤其关键的是引入了“路由依赖型”外部搜索架构与基于大语言模型(LLM)与规则混合的双层审查机制,结合自校正生成环路与引用完整性验证,显著提升了生成内容的可信度与技术推理能力。
链接: https://arxiv.org/abs/2606.01613
作者: Kanwar Bharat Singh
机构: The Goodyear Tire and Rubber Company (固特异轮胎橡胶公司)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:This paper presents an agentic retrieval-augmented generation (RAG) framework for domain-specific technical reasoning support, instantiated over a curated corpus of approximately 2,100 academic papers in intelligent tires, vehicle dynamics, and vehicle control. Unlike conventional single-pass RAG systems, the proposed architecture employs a 13-step autonomous pipeline that classifies queries by intent, scores evidence sufficiency against a multi-dimensional rubric, performs agentic retry with drift-guarded query reformulation, searches external academic databases (Crossref, OpenAlex, Semantic Scholar) through iterative optimize–search–vet loops, traverses a Neo4j knowledge graph for relational context, verifies citation integrity, and applies post-generation quality checks with automatic regeneration. Key contributions include a 100-point evidence sufficiency scoring framework across five dimensions with relevance damping and hybrid rule-based/LLM review; a route-dependent external search architecture with iterative agentic loops; a knowledge graph constructed via LLM-based entity extraction and OpenAlex author validation with intra-corpus citation resolution; and a self-correcting generation loop with citation verification and quality assessment. The framework is presented as a practical, implemented case study illustrating how agentic, evidence-grounded RAG can support literature navigation and technical reasoning over large, domain-specific corpora.
[MA-14] Physics-Informed Modeling and Control of Emergent Behaviors in Robot Swarms
【速读】:该论文旨在解决多阶段群体涌现行为在机器人集群中建模与控制的难题,尤其针对行为随时间演化呈现多个阶段时,如何实现可解释、可调控的集体智能。其核心挑战在于如何在保持局部感知与去中心化决策的前提下,对跨阶段的群体动态进行统一建模与精准控制。解决方案的关键在于提出一种物理信息驱动的微-宏框架PhySwarm,通过宏观层面的多相输运-扩散-反应模型(Macro-ADR)描述基于相变的群密度场演化,微观层面则采用等效确定性运动模型(Micro-EDM)以势场输运、密度梯度补偿及速率或事件触发的相位切换机制实现可执行的机器人运动。进一步引入神经物理控制器(NPC),结合强化学习与物理信息神经网络(PINN)目标函数,将局部观测与时序记忆映射为有界物理参数,并同时优化任务奖励、宏观密度残差与微观运动一致性约束。实验验证了该框架在路径引导觅食、构型可重构导航和角色自适应搜救等任务中生成多样化多阶段涌现行为的能力,揭示了输运、扩散与反应协同调控群体组织的可解释机制,为机器人集群的涌现行为学习、解析与控制提供了物理可解释的统一范式。
链接: https://arxiv.org/abs/2606.01597
作者: Zixuan Jin,Wenzhuo Zhang,Shuxian Quan,Zirui Dong,Fangwen Ye,Yuchen Shi,Cheng Xu
机构: 未知
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注:
Abstract:Robot swarms can exhibit coherent collective behaviors through local perception, limited communication and decentralized decision-making, yet modeling and controlling such emergence remains challenging when behaviors unfold over multiple phases. Here we introduce PhySwarm, a physics-informed micro–macro framework that represents multi-stage swarm emergence as physically constrained density-field evolution coupled to executable robot motion. At the macroscopic level, a multi-phase advection–diffusion–reaction model (Macro-ADR) describes phase-dependent swarm-density evolution through directed transport, diffusion-based spatial regulation and behavioral phase transitions. At the microscopic level, an equivalent deterministic motion model (Micro-EDM) realizes these mechanisms through potential-field advection, density-gradient compensation and rate- or event-gated phase switching. A neural-physics controller (NPC) maps local observations and temporal memory to bounded physical parameters, and is trained with a reinforcement learning–PINN objective that combines task rewards with macro-scale density residuals and micro-scale motion-consistency constraints. In several proof-of-concept swarm missions – including trail-guided foraging, formation-reconfigurable navigation and role-adaptive search and rescue – we demonstrate that PhySwarm can generate distinct multi-stage emergent behaviors within a unified physics-informed modeling framework. The learned density fields and physical parameters provide interpretable evidence of how advection, diffusion and reaction jointly regulate multi-stage swarm organization. These results establish a physics-informed route for learning, interpreting and controlling emergent behaviors in robot swarms.
[MA-15] Agent System Operations: Categorization Challenges and Future Directions
【速读】:该论文旨在解决大语言模型(Large Language Models, LLM)驱动的智能体系统在实际运行中频繁遭遇异常问题,导致系统不稳定与不安全,而现有针对智能体系统运维的研究极为匮乏,缺乏系统性方法论支撑的困境。其解决方案的关键在于提出一个全新的、全面的智能体系统运维框架——Agent System Operations (AgentOps),该框架包含四个核心阶段:监控(monitoring)、异常检测(anomaly detection)、根因定位(root cause localization)和故障恢复(resolution),并首次系统地将智能体系统中的异常划分为内部智能体异常(intra-agent anomalies)与跨智能体异常(inter-agent anomalies),为构建稳定、可解释且可维护的智能体系统提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2606.01581
作者: Zexin Wang,Changhua Pei,Yuanhao Liu,Jingjing Li,Yintong Huo,Quan Zhou,Haotian Si,Hang Cui,Zihan Liu,Gaogang Xie,Fei Sun,Dan Pei,David Lo
机构: 中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation); 中国科学院计算技术研究所(Institute of Computing Technology, Chinese Academy of Sciences); 国家自然科学基金委员会(National Natural Science Foundation of China); 香港研究资助局(RGC)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.
[MA-16] Multi-Agent Computer Use
【速读】:该论文旨在解决当前计算机使用代理(Computer Use Agents, CUAs)普遍采用单串行代理架构在处理复杂、长时程任务时存在的效率瓶颈问题。此类架构难以有效实现任务分解、并行执行以及基于新信息的持续重规划,导致在面对需要多步骤协调与动态调整的任务时表现受限。其解决方案的关键在于提出一种多智能体计算机使用系统(Multi-Agent Computer Use, MACU),通过引入一个管理模型(manager model)将复杂任务以有向无环图(DAG)形式进行结构化分解,显式编码子任务间的依赖关系与目标。在每轮迭代中,管理器并行调度多个子代理(subagents)执行DAG中就绪节点,并根据子代理反馈实时更新DAG结构(如增删或重构节点),从而动态适应环境变化。该设计将计算机操作环境中部分可观测性这一核心挑战作为首要考虑因素:关键信息由管理器持久保存并传递至后续任务,避免因下游代理无法重新观测而造成信息丢失。实验表明,MACU在桌面操作(OSWorld)和网页导航(Online-Mind2Web、WebTailBench、Odysseys)等基准上相较强基线模型性能提升3.4%–25.5%,展现出更优的测试时缩放特性,并成功完成单代理系统卡死的复杂长时程任务;尤其在Odysseys基准上,平均任务完成时间缩短约1.5倍,验证了其显著加速传统代理流水线的能力。研究结果表明,多智能体协同是推动计算机使用代理向更长时间、更高效率方向演进的重要路径。
链接: https://arxiv.org/abs/2606.01533
作者: Jing Yu Koh,Ruslan Salakhutdinov,Daniel Fried
机构: 未知
类目: Multiagent Systems (cs.MA); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by 3.4-25.5% on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by \sim 1.5 \times , demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at this https URL.
[MA-17] LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
【速读】:该论文旨在解决多智能体大语言模型(Multi-Agent LLM)在软件架构设计任务中协作拓扑结构对设计质量影响的系统性问题,核心在于识别最优的协作模式以提升生成方案的完整性、一致性和创新性。其解决方案的关键在于通过一个2×2×2因子实验设计(权威性 × 角色分配 × 动态机制),全面评估12种不同的多智能体协作拓扑,并基于三名独立自动化评估者(GPT-OSS 120B、Claude Opus 4.6、Claude Sonnet 4.6)组成的加权集成评估体系进行量化分析。研究发现,结构化对抗式协作(v4b)表现最佳,其通过提示工程强制要求重构而非局部修补,显著提升了设计质量;跨模型评审(即由一个模型生成、另一个模型评审)在所有评估者中均位列第二,凸显了模型间互补性的价值;同时,评估者间的分歧揭示了不同模型家族对设计质量维度的权重差异,强调了评估多样性的重要性;而并行合并策略因出现“令牌饥饿”和“弗兰肯斯坦效应”导致性能严重下降,表明该模式在复杂任务中存在根本性缺陷。最终,通过加权集成(2×Opus + 2×Sonnet + 1×GPT-OSS)实现的稳健排名在520次实验中得到交叉验证,确立了可复现的协作优化范式。
链接: https://arxiv.org/abs/2606.01490
作者: Nagarjuna Kanamarlapudi,Praveen K
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 12 pages, 9 figures, 5 tables
Abstract:We present a controlled experiment evaluating 12 multi-agent LLM collaboration topologies for software architecture design. Using a 2\times2\times2 factorial design (Authority \times Roles \times Dynamics), we conducted 520 experimental runs across 8 design tasks of varying complexity, with 5 repetitions each. Designs were evaluated on a 12-dimensional rubric by three independent automated evaluators (GPT-OSS 120B, Claude Opus 4.6, Claude Sonnet 4.6). We report four core findings. First, structural adversarial (v4b) ranks #1 by ensemble – a prompt-engineered adversarial variant that demands rewrite mandates rather than patches (weighted ensemble: 4.637/5.0). Second, cross-model review wins unanimously at #2 – generate with one model, review with another – ranking #2 by all three evaluators (weighted ensemble: 4.606). Third, evaluator diversity is itself a finding – all three evaluators agree v4b is best and v3 is worst, but disagree sharply on v2b (Claude d=1.44 vs. GPT-OSS d=0.45), revealing how different model families weight design qualities. Fourth, parallel merge is fundamentally broken – all three evaluators place merge variants in the bottom tier (3.65-3.79), due to token starvation and the Frankenstein effect. The weighted ensemble ( 2\times Opus + 2\times Sonnet + 1\times GPT-OSS) provides robust rankings across 520 runs, confirmed through independent cross-validation.
[MA-18] Crazyflow: An Accurate GPU-Accelerated Differentiable Drone Simulator in JAX
【速读】:该论文旨在解决当前空中机器人(aerial robotics)算法开发中缺乏统一、高效且可扩展的仿真平台问题,尤其在高保真度、可微分性、群体协同与大规模并行计算等方面存在显著瓶颈。现有仿真器虽在特定方向(如高精度建模或单机性能)取得进展,但难以兼顾多维度需求,限制了从基于模型到数据驱动、从梯度优化到采样方法、从单体到多智能体系统等多样化算法的快速迭代与验证。其解决方案的关键在于提出Crazyflow——一个突破性的空中机器人仿真框架,通过创新的轻量化系统辨识管道与高度优化的并行计算架构,实现了前所未有的仿真速度:单机仿真效率超过现有最优方案一个数量级以上,支持每秒超5亿步的采样式避障计算,并可同时模拟数以千计的无人机集群(每群4000架)。该平台不仅具备分析梯度能力,支持无领域随机化下的亚厘米级轨迹跟踪精度,更突破传统“训练-部署”范式,首次实现飞行中实时强化学习,实证展示了在0.38秒内从零开始训练物理无人机恢复策略并成功稳定飞行的能力。此外,Crazyflow兼容所有开源Crazyflie模型,支持快速定制化平台重构,为生成高质量大规模合成数据提供了开放、可扩展的基础设施,推动了在线、执行中学习与优化的新一代算法发展。
链接: https://arxiv.org/abs/2606.01478
作者: Martin Schuck,Marcel P. Rath,Yufei Hua,AbhisheK Goudar,SiQi Zhou,Angela P. Schoellig
机构: Technical University of Munich(慕尼黑工业大学); University of Toronto(多伦多大学); Simon Fraser University(西蒙菲莎大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注:
Abstract:High-quality, large-scale synthetic data from simulations is becoming a cornerstone for pushing the capabilities of robot algorithms. While aerial robotics simulators have evolved to support specialized needs such as fidelity, differentiability, and swarms independently, a unified platform that can synthesize data across all these domains is missing. In this work, we propose Crazyflow, a simulator designed to push the limits of aerial-robotics algorithm development, from model-based to data-driven methods, gradient-based to sampling-based approaches, and single-agent to multi-agent systems. Compared to existing state-of-the-art drone simulators, it achieves speeds more than an order of magnitude faster for a single drone and can simulate thousands of swarms of 4000 drones each. Real-world experiments show Crazyflow supports both analytical-gradient-based policy learning, achieving sub-centimeter trajectory tracking accuracy without domain randomization, and sampling-based obstacle avoidance at speeds exceeding half a billion steps per second. Breaking the traditional train-then-deploy paradigm, we show that its unprecedented speed even enables in-flight reinforcement learning; we demonstrate this by throwing a physical drone into the air and training a recovery policy from scratch in 0.38 seconds, successfully stabilizing the drone. Crazyflow supports multiple levels of simulation abstraction, is directly compatible with all open-source Crazyflie models, and enables rapid reconfiguration across custom drone platforms and applications by providing a light-weight system identification pipeline. By pushing accuracy, speed, and differentiability simultaneously, Crazyflow serves as an open-source resource for synthetic data generation, with emerging capabilities for large-scale parallelization for online, in-execution learning and optimization, opening the door to novel algorithm development.
[MA-19] Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models
【速读】:该论文旨在解决癌症治疗药物研发中因肿瘤异质性及跨癌种分子靶点不明确所带来的挑战,尤其针对现有生成式AI模型在个性化药物发现中缺乏对药物敏感性、可合成性以及作用机制合理性等多维度目标的联合优化问题。其解决方案的关键在于提出一种基于预训练基因型到药物扩散模型的潜在空间优化方法,通过在分子潜在空间引入可学习的扰动,并利用梯度上升算法优化复合奖励函数,该奖励函数综合了预测药物敏感性(AUC)、类药性(QED)与合成可及性(SAS)。为确保生物真实性,奖励设计与评估均基于实验获得的癌细胞系数据和经验证的药理信号,使候选分子生成过程锚定于真实临床证据。此外,通过基于扩散模型注意力机制构建的多智能体大语言模型(LLM)流水线,进一步评估候选化合物的作用机制一致性与可解释性。在三个独立测试集共15个癌细胞系上的实验表明,该方法在药物敏感性、类药性、可合成性及化学有效性方面均显著优于现有基线方法。
链接: https://arxiv.org/abs/2606.01461
作者: Brenda Nogueira,Gisela A. Gonzalez-Montiel,Nitesh V. Chawla,Nuno Moniz
机构: University of Notre Dame(圣母大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model’s attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.
[MA-20] SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在执行长时交互任务时,依赖外部可复用技能所面临的技能适应性不足问题。现有无训练技能适配方法通常基于完整轨迹或会话级反馈进行更新,导致失败归因粗粒度,易引发不稳定的或过度泛化的修改。其解决方案的关键在于提出一种无需训练的、基于步骤级的显式失败归因机制——SkillAdaptor。该框架能够在给定失败轨迹后,精准识别首个可操作的故障步骤,将责任关联至候选技能,并在显式接受性检查下实施针对性更新,同时保持模型主干冻结。实验在WebShop、PinchBench和Claw-Eval三个基准上使用Kimi-K2.5、GLM-5和GPT-5.2进行验证,结果表明SkillAdaptor在所有评测集上均显著优于无技能与基线技能适配方法,单指标最大提升达+1.5(PinchBench平均得分%)、+1.8(Claw-Eval平均得分)和+1.7(WebShop成功率),证明了步骤级归因能够实现更稳定且可审计的无训练技能维护。
链接: https://arxiv.org/abs/2606.01311
作者: Zhuoyun Yu,Xin Xie,Wuguannan Yao,Chenxi Wang,Lei Liang,Xiang Qi,Shumin Deng
机构: Zhejiang University (浙江大学); Ant Digital Technologies, Ant Group (蚂蚁数字科技,蚂蚁集团); Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Work in progress
Abstract:Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnoteThe code will be released at this https URL…
[MA-21] Coordinating Task Switching in a Robotics Multi-Agent System Using Behavior Trees
【速读】:该论文旨在解决多机器人系统在动态复杂环境下的协同行为协调问题,特别是在微型足球机器人竞赛(IEEE Very Small Soccer, VSSS)场景中,如何实现三台机器人组成的团队在高强度对抗性任务中的高效协作。其核心挑战在于应对快速变化的比赛局势,确保各机器人在进攻、防守与位置调整等行为之间实现灵活、鲁棒且可扩展的协同决策。解决方案的关键在于提出一种基于行为树(Behavior Tree, BT)的多机器人协同框架,相较于以往采用有限状态机(Finite State Machine, FSM)的策略,该方法通过层次化结构实现了行为模块的可组合性与可维护性,显著提升了系统的灵活性和可调试性。实验结果表明,该行为树方法在仿真平台FIRASim中的表现优于传统FSM方案,并在实际学术机器人竞赛中验证了其有效性与实用性。
链接: https://arxiv.org/abs/2606.01170
作者: Lucas Haug,Anarosa Alves Franco Brandão,Arthur Casals
机构: LTI - Laboratório de Técnicas Inteligentes, Universidade de São Paulo (圣保罗大学)
类目: Multiagent Systems (cs.MA); Robotics (cs.RO)
备注: 7 pages, 7 figures. Preprint of a manuscript submitted to the XXVI Congresso Brasileiro de Automática (CBA 2026)
Abstract:The application of multi-agent systems in robotics is a very challenging field. Several competitions involving such systems are proposed to foster research and development of strategies and mechanisms using games as the underlying domain. Among them are the ones from the \textitIEEE Very Small Soccer (VSSS) category, which is the case study described in this paper. In VSSS, two teams of three robots each compete in a very dynamic environment of a soccer game. Thus, coordination of robots’ behavior during the game is crucial to win it. In this paper, we present a Behavior-Tree-based approach to support multi-robot coordination within the VSSS team of the ThundeRatz robotics team from the Universidade de S \tildea o Paulo. Moreover, a comparison between the proposed approach and the previous one, which was based on a Finite State Machine (FSM), was conducted using the FIRASim simulator. Besides that, the performance of this new strategy was further evaluated in an academic robotics competition.
[MA-22] When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding
【速读】:该论文旨在解决多智能体大语言模型(Multi-agent Large Language Model, LLM)系统在复杂任务(如软件编码)中因引入智能体而产生的跨智能体通信开销问题,该开销会抵消并行化带来的效率优势。其核心挑战在于如何在任务分解以缩短关键路径计算时间与减少跨智能体依赖导致的上下文传递成本之间实现平衡。解决方案的关键在于将多智能体编排形式化为图划分问题,通过静态分析构建项目级依赖图,识别结构枢纽文件,利用社区检测算法进行图分区,并采用依赖感知调度器执行任务。该方法——即协聚感知编码器(Cohesion-aware Coder, Co-Coder)——在DevEval和CodeProjectEval上的28个真实世界任务中展现出显著性能提升:相比串行和基于文件的并行基线以及Claude Code with Agent Teams,最高提升通过率14.0%,实现高达2.10倍的墙钟速度提升,并降低35%的API调用成本,尤其在依赖密集型项目中表现最优。该研究证明了基于内聚性感知的编排策略可使并行编码智能体兼具理论合理性与实践高效性,为多智能体系统设计提供了普适性的架构范式。
链接: https://arxiv.org/abs/2606.00953
作者: Xu Yang,Lunyiu Nie,Ethan Chandra,Stanislav Gannutin,Fangru Lin,Swarat Chaudhuri
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); University of Oxford (牛津大学)
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems. Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2606.00953 [cs.LG] (or arXiv:2606.00953v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00953 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-23] FinCom: A Financial Multi-Agent Demo with Disagree-or-Commit Deliberation
【速读】:该论文旨在解决当前基于大语言模型(LLM)的多智能体系统在金融分析与决策支持中因过度追求共识或辩论而引发的“奉承倾向”(sycophancy)问题,即智能体倾向于迎合同伴观点而非依据证据进行独立判断,导致过早达成一致并降低决策质量。其解决方案的关键在于提出FinCom(Financial Committee)框架,该框架通过实施“不赞同则承诺”(Disagree-or-Commit, DoC)协议,将结构性异议作为治理核心机制嵌入金融智能体委员会。系统由一个中央监督者(Supervisor)协调三个具备ReAct能力的专业智能体——研究、量化与风险分析,各智能体配备专用工具以执行信息检索、计算建模和压力测试。在协商过程中,智能体必须明确批判或正式承诺同行推理后方可达成统一建议,从而强制引入理性分歧。实验表明,在自研及外部金融任务评估中,采用DoC协议的模型在推理准确率与风险意识方面显著优于传统共识导向基线。该方法将分歧重构为可治理的元机制,提供了一种仅依赖提示词的轻量级方案,有效提升了金融领域生成式智能体系统的问责性、透明度与认知鲁棒性。
链接: https://arxiv.org/abs/2606.00939
作者: Chao Peter Yang,Zixiao Tan,Kaisen Yao,Ziyu Zhou,Eleanor Jiang,Michael Wu
机构: Duke University(杜克大学); ClearPath(清晰路径)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent systems powered by large language models (LLMs) are increasingly used for financial analysis and decision support. However, existing coordination schemes, especially those emphasizing consensus or debate, are vulnerable to sycophancy: agents conform to peer reasoning instead of evidence, leading to premature agreement and degraded outcomes. We introduce FinCom (Financial Committee), a governed multi-agent framework and interactive system that operationalizes the Disagree-or-Commit (DoC) protocol to embed structured dissent into financial AI committees. A central Supervisor orchestrates three ReAct-enabled specialist agents: Research, Quantitative, and Risk. Each agent is equipped with role-specific tools for retrieval, computation, and stress testing. During deliberation, agents must either explicitly critique or commit to their peers’ reasoning before converging on a unified recommendation. This demonstration showcases how FinCom supports committee-style financial analysis through coordinated multi-agent interaction, including structured report generation and interactive decision support. Evaluated across the most recent financial agent benchmark, in addition to 90 internal handcrafted financial tasks using an LLM-as-a-Judge protocol, DoC improves reasoning accuracy and risk awareness significantly over a consensus-seeking baseline on both an in-house and external evaluation set. By reframing disagreement as a governance primitive rather than noise, FinCom offers a lightweight, prompt-only recipe for improving accountability, transparency, and epistemic robustness in agentic financial systems. Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2606.00939 [cs.MA] (or arXiv:2606.00939v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.00939 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-24] SuperMemory-VQA: An Egocentric Visual Question-Answering Benchmark for Long-Horizon Memory
【速读】:该论文旨在解决当前生成式AI在长期自我中心视频(egocentric video)中难以满足人类真实记忆需求的问题,尤其针对日常生活中因时间跨度长而产生的记忆断层(memory gaps),如物品位置、事件顺序、对话内容及意图回忆等。现有自中心数据集多聚焦于短时动作识别或通用问答任务,无法有效评估系统在纵向时间维度上的记忆能力。为此,论文提出SuperMemory-VQA数据集,涵盖52.9小时的真实生活场景视频,同步采集RGB视频、语音转录、眼动追踪、惯性测量单元(IMU)及SLAM轨迹等多模态信息,并通过人工验证构建了4,853个具有地面实况的问答对,覆盖对象与位置记忆、意图回溯、视觉场景重建、时间线还原、对话记忆以及上下文检索等多种实际记忆任务。关键创新在于引入多选题形式并设置“不可回答”选项,以评估模型在缺乏充分证据时的幻觉鲁棒性。基准测试表明,当前主流代理框架与大语言模型(LLM)在真实世界长时记忆任务上仍表现不佳,凸显了构建具备证据依赖机制的具身化AI记忆架构的重要性。用户调研进一步验证了该数据集问题的真实性与实用性,符合日常记忆需求。
链接: https://arxiv.org/abs/2606.00825
作者: Samiul Alam,Shakhrul Iman Siam,Michael J. Proulx,James Fort,Richard Newcombe,Hyo Jin Kim,Mi Zhang
机构: The Ohio State University (俄亥俄州立大学); Meta
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA)
备注: 34 pages, 21 figures, 5 tables
Abstract:AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit “unanswerable” option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
[MA-25] Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems
【速读】:该论文旨在解决企业多智能体系统(Multi-agent Systems, MAS)在实际部署中缺乏明确依据以确定何时采用共识(consensus)、辩论(debate)、综合(synthesis)或简化单智能体工作流等协调策略的问题。其核心挑战在于现有实践往往采用全局固定的协调模式,而未根据具体任务类型动态调整。论文提出的关键解决方案是:基于问题类别进行动态协调策略选择,而非预先固定策略。通过在涵盖六个行业、五类问题、四种执行条件的30项企业任务上进行大规模实验(共1,440次生成输出),并使用统一评分标准评估结果,研究发现尽管原始假设(即存在严格意义上的最优策略)未获支持,但“近似最优路由”(near-best routing)这一弱化命题得到强有力验证——在所有预注册模型臂及问题类别中,预测的最佳策略始终与实际表现最优条件相差不超过0.10个质量得分点;仅在结构合规性验证场景中出现例外,各模型均偏好单智能体模式而非共识机制。此外,越南语域与英语域任务在协调策略排序一致性上无显著差异(Kendall’s W均值为0.20,p = .85),表明语言域影响不显著。最终结论指出,企业级协调政策应将动态路由作为校准后的默认机制,而非确定性的最优策略选择法则。
链接: https://arxiv.org/abs/2606.00804
作者: Thanh Luong Tuan
机构: Golden Gate University(金门大学); Foundation AgenticOS (FAOS)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 4 appendix
Abstract:Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall’s W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law. Comments: 13 pages, 4 appendix Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.00804 [cs.MA] (or arXiv:2606.00804v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.00804 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[MA-26] Scaling Behavior of Single LLM -Driven Multi-Agent Systems
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统(Multi-Agent System, MAS)在智能体数量增加时的可扩展性规律与内在协同动力学机制不明确的问题。现有研究普遍假设更多智能体必然带来更强性能,但缺乏对协作增益与协调开销之间权衡关系的系统性分析。其解决方案的关键在于提出一种极简的顺序迭代多智能体系统(Sequential Iterative Multi-Agent System, SIMAS)框架,通过严格控制变量、仅保留序列化智能体间通信机制,从而清晰观测智能体数量对系统性能的影响。实验结果表明,MAS性能并非随智能体数量单调上升,而是呈现边际收益递减趋势,其根本原因在于协作协同效应与协调开销之间的动态平衡。研究进一步揭示:高效多智能体系统依赖于具备足够能力的基础模型,最优智能体数量受任务类型显著调节,且集体智能是策略性交互设计所催生的涌现属性,而非单纯由智能体数量决定。此外,性能下降主要源于协调开销而非长上下文处理失败,并且该缩放规律在结构化辩论拓扑等不同交互架构中具有泛化性。本工作为多智能体系统的可扩展性提供了基础理论框架,为构建高效协同系统提供了实践指导,并挑战了“智能体越多越好”的主流认知。
链接: https://arxiv.org/abs/2606.00655
作者: Jialing Li,Zhouhong Gu,Yin Cai,Hongwei Feng
机构: Fudan University (复旦大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The burgeoning field of LLM-based Multi-Agent Systems (MAS) promises to tackle complex tasks through collaborative intelligence, yet fundamental questions regarding their scaling behavior and intrinsic collective dynamics remain underexplored. This paper systematically investigates how the performance of a homogeneous MAS evolves as the number of agents increases, isolating the variable of collaboration from model or knowledge heterogeneity. We propose the Sequential Iterative Multi-Agent System (SIMAS) framework, a minimalist architecture centered on sequential inter-agent communication, to clearly observe scaling effects. Through extensive experiments across diverse tasks and model scales, we establish that MAS performance does not scale monotonically with agent count but follows a pattern of diminishing returns, governed by a trade-off between collaborative synergy and coordination overhead. Our findings reveal that effective MAS requires a sufficiently capable base LLM, that task type critically modulates the optimal agent count, and that collective intelligence is an emergent property contingent on strategic interaction design rather than a guaranteed outcome of agent plurality. The performance degradation stems coordination overhead rather than merely long-context failure, and the scaling tendency generalizes across interaction architectures like structured debate topologies. This work provides a foundational understanding of MAS scaling laws, offering practical guidance for designing efficient collaborative systems and challenging the prevailing assumption that more agents invariably lead to better performance.
[MA-27] MemGraphRAG : Memory-based Multi-Agent System for Graph Retrieval-Augmented Generation KDD2026
【速读】:该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)在处理大规模、非结构化语料时因信息高度碎片化而导致的检索不完整与逻辑不一致问题,尤其针对现有图结构RAG(GraphRAG)方法依赖局部片段级知识提取而缺乏全局语料视角所引发的主题不一致、逻辑冲突及结构断裂等缺陷。其解决方案的关键在于提出MemGraphRAG框架,通过引入基于共享记忆的多智能体协同系统,构建具备统一全局上下文感知能力的知识图谱。该机制使智能体能够在知识抽取过程中动态识别并解决逻辑矛盾,维持跨文档的结构连贯性;同时,设计了一种面向构建图谱的内存感知分层检索算法,显著提升复杂推理场景下的检索质量。实验结果表明,该方法在多个基准测试中优于当前最优模型,且保持了相近的计算效率。
链接: https://arxiv.org/abs/2606.00610
作者: Chuanjie Wu,Zhishang Xiang,Yunbo Tang,Zerui Chen,Qinggang Zhang,Jinsong Su
机构: Xiamen University(厦门大学); Jilin University(吉林大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted by KDD 2026
Abstract:Retrieval-Augmented Generation (RAG) has become an essential method for mitigating hallucinations in Large Language Models (LLMs) by leveraging external knowledge. Although effective for simple queries, traditional RAG struggles with large-scale, unstructured corpora where information is highly fragmented. Graph-based RAG (GraphRAG) incorporates knowledge graphs to capture structural relationships, enabling more comprehensive retrieval for complex reasoning. However, existing GraphRAG methods rely on isolated, fragment-level extraction for graph construction, lacking a global perspective on the whole corpus. As a result, these methods frequently lead to thematically inconsistent, logically conflicting, and structurally fragmented graphs that degrade retrieval performance. In this paper, we propose MemGraphRAG, a novel framework that introduces a memory-based multi-agent system to ensure high-quality graph construction. Specifically, MemGraphRAG employs a collaborative society of agents supported by shared memory, which provides a unified global context throughout the extraction process. This mechanism allows agents to dynamically resolve logical conflicts and maintain structural connectivity throughout the corpus. Furthermore, we propose a memory-aware hierarchical retrieval algorithm tailored for the constructed graph. Extensive experiments on multiple benchmarks demonstrate that MemGraphRAG outperforms the state-of-the-art baseline models with comparable efficiency. Our code is available at this https URL.
[MA-28] State Machine Guided Multi-Relational Synthetic Data from Logs for Anomaly Detection
【速读】:该论文旨在解决现有日志异常检测方法将日志视为扁平的模板序列,忽视了事件之间随时间演进所遵循的隐含执行状态结构的问题。其核心解决方案是通过从日志中直接恢复一个执行状态机(execution state machine),并据此构建连接轨迹、事件、状态、转移及参数的多表关系模式(multi-table relational schema)。该发现的状态机作为生成先验,用于生成保持结构、时序与流程约束的真实感多关系合成数据,同时增强罕见但合法的执行行为。实验表明,基于该框架生成的数据在约束验证、分布相似性及过程级指标上具有高保真度,并显著提升了在独立真实数据集上的异常与缺陷检测性能,优于基于序列的基线方法和简单的过采样策略。研究揭示了执行日志隐含编码了一个由潜在状态机驱动的关系型数据库,而恢复这一结构可实现可解释且鲁棒的合成数据生成,从而推动更精准的异常检测。
链接: https://arxiv.org/abs/2606.00531
作者: Aja Khanal,Apurva Narayan
机构: University of Western Ontario(Western Ontario大学); London(伦敦); Canada(加拿大)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Software systems generate massive unstructured logs that record execution behavior, failures, and interactions across components, yet existing log anomaly detection methods treat these logs primarily as flat sequences of templates, overlooking the relational execution structure that governs how events co-occur and evolve over time. We propose a framework that discovers this hidden structure by recovering an execution state machine directly from logs and inducing a corresponding multi-table relational schema connecting traces, events, states, transitions, and parameters. This discovered state machine serves as a generative prior to produce realistic multi-relational synthetic data that preserves structural, temporal, and process constraints while amplifying rare but valid execution behaviors. We assess the fidelity of the generated data through constraint validation, distributional similarity, and process-level metrics, and demonstrate its usefulness by showing that augmenting real logs with the synthetic relational data significantly improves anomaly and bug detection on held-out real datasets compared to sequence-based baselines and naive oversampling. Our results show that execution logs implicitly encode a relational database governed by a latent state machine, and that recovering this structure enables principled synthetic data generation for robust and interpretable anomaly detection.
[MA-29] Leverag ing the Learning Curve: Reusing Existing Architectural Patterns to Design and Implement MAS
【速读】:该论文旨在解决当前多智能体系统(Multi-Agent Systems, MAS)在软件工程实践中普遍忽视其固有的分布式与协作特性的问题,尤其针对现有专用系统多作为其他人工智能系统组件使用、缺乏统一架构指导的现象。其核心挑战在于如何将分布式系统(Distributed Systems, DS)工程中的成熟方法与已建立的智能体理论相结合,以提升现代MAS的可工程化水平。解决方案的关键在于提出一个最小化的智能体相关概念集,并将其嵌入分布式系统领域,从而实现两大目标:一是通过在分布式系统架构模式中引入这些智能体概念,设计出可扩展的分布式多智能体系统;二是在无智能体理论基础的学生群体中开展教学实践,验证该方法在降低学习门槛、提升工程实现能力方面的有效性。实验结果表明,即便超过三分之二的学生缺乏分布式系统开发经验,其平均课程成绩仍高于80%,证明了该方法在教学与实际应用中的可行性。该研究为融合现代生成式AI技术与传统智能体理论提供了统一的工程框架,支持在保持学术严谨性的同时,高效利用成熟的分布式系统技术进行先进系统的构建。
链接: https://arxiv.org/abs/2606.00287
作者: Arthur Casals,Anarosa A. F. Brandão
机构: Escola Politécnica da USP (USP工学院); São Paulo, Brazil (圣保罗, 巴西)
类目: Multiagent Systems (cs.MA); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Author’s accepted manuscript of an article published in IEEE Access. 17 pages, 6 figures. IEEE Access, vol. 13, pp. 45809-45825, 2025. Copyright 2025 IEEE. Personal use of this material is permitted. The final version is available at this https URL
Abstract:Recent advancements in AI have led to the development of specialized systems related to multi-agent systems (MAS). However, the inherently collaborative nature of agents is often overlooked, and many of these specialized systems are used as components by other AI systems. From a software engineering perspective, this context can benefit from aligning the architectural characteristics of distributed systems with the inherently distributed nature of MAS. We propose that introducing a minimal set of agent-related concepts into the Distributed Systems (DS) domain can improve the engineering of modern MAS by leveraging techniques from DS engineering with established agent theory. In this study, we recapitulated the common origins of MAS and DS by drawing architectural parallels to establish a unified engineering approach. We then defined a minimal set of agent concepts to perform two practical studies on leveraging MAS development. First, we incorporated these concepts into a DS architectural pattern to design a distributed MAS. We then used these concepts in a graduate course to teach MAS engineering to students with no prior knowledge of agent theory. The learning outcomes from both courses included successful MAS implementation using DS tools and techniques. Although more than two-thirds of these students had no practical experience in developing distributed systems, the average final grade in both courses was above 80%, thus validating our approach. Finally, we discuss how this study supports the development of advanced systems using modern AI techniques consistently with established agent-related research while leveraging established DS techniques and concepts.
[MA-30] MindZero: Learning Online Mental Reasoning With Zero Annotations ICML2026
【速读】:该论文旨在解决现实世界中智能体实现有效辅助所面临的三大核心挑战:(1)在多假设条件下实现鲁棒的在线心理状态推断与不确定性更新;(2)支持实时辅助所需的高效推理能力;(3)真实场景中缺乏心理状态的真值标注。其解决方案的关键在于提出MindZero——一种基于自监督强化学习的框架,通过训练多模态大语言模型(MLLMs)实现高效且鲁棒的在线心理推理。该方法在训练阶段采用基于规划器估计观测行为似然性的奖励机制,促使模型生成高可能性的心理状态假设,从而模拟基于模型的理论心智(ToM)推理过程,无需依赖显式的心理状态标注。训练完成后,MindZero将基于模型的推理机制内化为高效的单次前向传播推理,显著提升推理速度与准确性。实验结果表明,仅使用大语言模型(LLM)难以胜任复杂心理推理任务,而传统基于模型的方法虽准确但效率低下且受限于基础模型容量;相比之下,MindZero不仅显著增强了MLLMs的内在理论心智能力,还在准确率和效率上全面超越现有方法,证明了心理推理可作为一项自监督技能被有效学习。
链接: https://arxiv.org/abs/2606.00240
作者: Shunchi Zhang,Jin Lu,Chuanyang Jin,Yichao Zhou,Zhining Zhang,Tianmin Shu
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: ICML 2026. Website: this https URL
Abstract:Effective real-world assistance requires AI agents with robust Theory of Mind (ToM): inferring human mental states from their behavior. Despite recent advances, several key challenges remain, including (1) online inference with robust uncertainty updates over multiple hypotheses; (2) efficient reasoning suitable for real-time assistance; and (3) the lack of ground-truth mental state annotations in real-world domains. We address these challenges by introducing MindZero, a self-supervised reinforcement learning framework that trains multimodal large language models (MLLMs) for efficient and robust online mental reasoning. During training, the model is rewarded for generating mental state hypotheses that maximize the likelihood of observed actions estimated by a planner, similar to model-based ToM reasoning. This method thus eliminates the need for explicit mental state annotations. After training, MindZero internalizes model-based reasoning into fast single-pass inference. We evaluate MindZero against baselines across challenging mental reasoning and AI assistance tasks in gridworld and household domains. We found that LLMs alone are insufficient; model-based methods improve accuracy but are slow, costly, and limited by backbone MLLM capacity. In contrast, MindZero enhances MLLMs’ intrinsic ToM ability and significantly outperforms model-based methods in both accuracy and efficiency, showing that mental reasoning can be effectively learned as a self-supervised skill.
[MA-31] When Agents Talk: Discourse Manipulation and Risk in an Agent ic Social Network
【速读】:该论文旨在解决生成式 AI(Generative AI)代理在共享在线环境中大规模交互所引发的新型操作安全风险问题。其核心挑战在于,尽管多数内容为良性,但存在显著比例的有毒、操纵性或恶意内容,且这些有害行为常隐匿于常规功能讨论之中,难以通过传统监控手段识别。解决方案的关键在于构建多层级分析框架:首先利用语义聚类识别高互动性内容主题,再结合大语言模型(LLM)辅助分类与人工审阅相结合的方式,对高风险样本进行精细化标注;最终识别出74类具体恶意行为模式,涵盖凭证窃取、远程执行指令、代理路由引导及未经信任的技能安装等,并揭示了可于数分钟内生成数千条内容的协同发布攻击行为,从而实现对复杂、隐蔽的AI代理间恶意活动的有效检测与归因。
链接: https://arxiv.org/abs/2606.00067
作者: 10a Labs:Grace Cheong,Violet Davis,Juliette Garcia,Kendal Gee,Molly Hart,Nicholas Hayes,Henry Houghton,Kyle Lee,Paige Lee,Vicky Lee,Hailey May,Bobby McKenzie,Christine McNeill,Han Nguyen,Brooke Perreault,David Pham,Charlie Plumb,Olivia Quill,Matthew Swain,Grace Wang,Adam Warren,Corie Wieland,Zachary Yahn
机构: 10a Labs(10a实验室)
类目: ocial and Information Networks (cs.SI); Multiagent Systems (cs.MA)
备注:
Abstract:AI agents are increasingly interacting within shared online environments, creating new operational security risks. We analyze activity on Moltbook, a Reddit-style social platform where AI agents–typically configured and overseen by human operators–post and interact with one another at scale. Using a dataset of 228,684 posts produced by more than 39,500 accounts over a seventeen-day observation window, we combine semantic clustering of high-engagement posts with LLM-assisted classification of harmful content and manual review of high-risk samples. The analysis identifies 98 thematic discourse clusters spanning agent infrastructure, autonomy debates, and financial activity. While most observed content was benign, 18.28% of posts contained toxic, manipulative, or malicious material. We cluster malicious content and identify 74 classes of malicious behavior, including credential harvesting attempts, host-execution instructions, proxy routing guidance, and efforts to install untrusted agent skills. Harmful content frequently appeared within mainstream operational discussions about agent functionality. We also document coordinated posting campaigns capable of generating thousands of posts in minutes.
[MA-32] Fake Plastic Voters: When Political Parties Can Use AI-Simulated Focus Groups
【速读】:该论文旨在解决政治竞选研究中如何有效运用生成式AI增强型模拟技术(AESTs)以替代传统焦点小组的问题。其核心挑战在于:在保证研究效度的前提下,明确AESTs适用的边界与情境,避免因过度依赖技术而削弱对政治话语与身份建构等复杂社会互动过程的理解。解决方案的关键在于提出一个三维度决策矩阵,整合战略目的、部署风险及模拟工具的经验基础,从而指导策略制定者合理选择研究方法。其中,战略目的为决定性维度——若研究目标是观察政治意义与身份在互动中自然涌现(模式1),则无论部署风险高低,均不可由AESTs替代真实人类互动;若目标为测试和优化竞选信息(模式2),则需结合部署风险与工具经验基础进行权衡。研究强调,即便在模式2中,对AESTs的常规依赖也可能侵蚀基于深度质性洞察的判断力,警示应谨慎对待技术替代的边界。
链接: https://arxiv.org/abs/2606.00043
作者: Claudio Novelli,Javier Argota Sanchez-Vaquerizo,Jennifer Cyr,Giuliano Formisano,Simon McDougall,Giulia Sandri,Luciano Floridi
机构: 未知
类目: Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注:
Abstract:Political parties strive to understand their electorates, and focus groups are a vital tool in these efforts. AI-enhanced simulation technologies (AESTs) enable synthetic focus groups in a fraction of the time (and cost), raising the question of when and how such simulated evidence can be used in campaign research. This paper develops a decision matrix to help party strategists match research needs to appropriate simulation technologies and to identify when to escalate to hybrid or fully human focus groups. The matrix combines three dimensions: strategic purpose, deployment risk, and empirical grounding of the simulation tool. Strategic purpose is the decisive dimension, as it determines what kind of evidence the focus group is meant to produce: observing how political meanings and identities emerge through interaction (Mode 1) or testing and refining campaign messages (Mode 2). The matrix shows that, given documented failure modes such as sycophancy, persona drift, and the suppression of minority viewpoints, AESTs cannot replace human interaction in Mode 1 at any risk level. Within Mode 2, suitability depends instead on deployment risk and on the empirical grounding. Yet even here, we caution that routine reliance on AESTs may erode the qualitative craft on which sound judgment depends.
[MA-33] MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution NEURIPS2025
【速读】:该论文旨在解决多智能体战略交互中语言模型代理训练的核心难题:单步动作的质量可能依赖于从未发生过的未来事件、违反游戏规则的行动,或其它智能体所作决策。传统强化学习(Reinforcement Learning, RL)假设每一步均可获得即时奖励,但在时间与智能体间结果高度耦合的场景下,这一假设失效。其解决方案的关键在于提出延迟的逐步奖励归因结合资格门控机制(delayed per-step reward attribution with eligibility gating),通过构建一个完整的训练周期生命周期和后处理流程,在回合结束时统一计算奖励,并依据任务特定语义将奖励回溯至对应动作步骤,同时排除缺乏有效依赖信息的无效步骤参与训练。该方法与vLLM的连续批处理实现异步采样、基于课程的对手采样以及多层次分层批量构建相结合,显著提升了多智能体环境中的训练稳定性与样本效率。在NeurIPS 2025的MindGames Arena基准测试中,仅使用80亿参数的开源模型即在对抗测试中达到甚至超越更大规模专有系统(如GPT-5),并在开放(无限制)与高效(=80亿参数)两个赛道均取得第一名。
链接: https://arxiv.org/abs/2606.00017
作者: Aliaksei Korshuk,Alexander Buyantuev,Ilya Makarov
机构: iMak AI Lab (iMak人工智能实验室); Innopolis University (伊诺波利斯大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: 18 pages, 2 figures, 9 tables. Technical report. First place in both Open and Efficient tracks of MindGames Arena Generalization Track at NeurIPS 2025
Abstract:Training language model agents for multi-agent strategic interaction presents a core difficulty: the quality of any action may depend on future events that never materialize, on moves that violate game rules, or on decisions made by other players. Standard reinforcement learning assumes that rewards can be assigned at each step, but this assumption fails in settings where outcomes are entangled across time and agents. We introduce delayed per-step reward attribution with eligibility gating, an episode lifecycle and postprocessing pipeline that computes rewards only at episode end, propagates them back to originating steps according to task-specific semantics, and excludes steps that lack valid dependent information from training. Together with asynchronous rollout generation via vLLM’s continuous batching, curriculum-based opponent sampling, and multi-level stratified batch construction, this approach enables stable, sample-efficient RL training in multi-agent environments. We evaluate on the MindGames Arena benchmark at NeurIPS 2025, where a single 8-billion-parameter open-source model trained with our method matched or surpassed substantially larger proprietary systems, including GPT-5, in head-to-head play and took first place in both the Open (unrestricted) and Efficient (=8B parameters) tracks.
[MA-34] A No-Regret Framework for Adaptive Incentive Design
【速读】:该论文旨在解决在具有连续动作空间和私有成本信息的非线性博弈中,如何设计激励机制以实现社会最优均衡的问题。核心挑战在于:中央规划者(authority)无法直接观测个体代理者的成本函数,需通过反复观察其策略响应来学习其偏好,同时设计激励措施引导纳什均衡向社会最优行动配置收敛。解决方案的关键在于提出一种无后悔自适应激励设计(No-Regret Adaptive Incentive Design, RAID)框架,其核心是构建一个仅需渐减激励(diminishing excitation)即可保证强一致性的最小二乘估计器。基于此弱激励要求,设计了一种切换激励策略,交替执行探查(探索)与基于估计结果的利用(exploitation)阶段,从而实现了参数估计的 O(t−0.5) 收敛速率,并累积几乎必然的平方社会成本后悔项为 O(t0.5logt)。此外,针对存在内生噪声导致标准最小二乘估计出现误差变量相关偏差的情况,进一步引入重复采样估计器及对应的切换策略,仍保持相同的几乎必然收敛与后悔率。数值实验验证了该方法的有效性及其理论预测的收敛性能。
链接: https://arxiv.org/abs/2606.02529
作者: Georgios Vasileiou,Lantian Zhang,Silun Zhang
机构: KTH Royal Institute of Technology (皇家理工学院)
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: 21 pages, 5 figures
Abstract:Incentive design studies how a central authority can influence strategic agents through payments, subsidies, or taxes, so that individual objectives align with collective welfare. This paper introduces a No-Regret Adaptive Incentive Design (RAID) framework for nonlinear games with continuous action spaces and private agent costs. In this framework, the authority (planner) designs incentives that regulate the Nash equilibrium toward a socially optimal action profile, while simultaneously learning agents’ unknown preferences from repeated strategic responses. We formulate the RAID problem and construct a least-squares estimator whose strong consistency requires only diminishing excitation. Leveraging this weak excitation requirement, we propose a switching incentive policy that alternates between probing (exploration) and estimate-based (exploitation) incentives. The resulting policy achieves an O(t^-0.5) parameter estimation rate and accumulates O(t^0.5\log t) squared social-cost regret, almost surely. We further extend the framework to an endogenous-noise response model, where standard least-squares estimation is biased due to an error-in-variables correlation between the noise and agent responses. We utilize a repeated-sampling estimator and corresponding switching policy that retain the same almost-sure convergence and regret rates. Numerical experiments validate the effectiveness and predicted convergence rates of the method.
[MA-35] Civilizational Metamaterials: Engineering Coordination Under Capability Gradients and Structural Turbulence
【速读】:该论文旨在解决人工智能治理中因决策速度与人类验证能力不匹配所引发的系统性失效问题,核心挑战在于:当生成式 AI(Generative AI)显著提升决策速度时,若其输出结果的验证成本超过行动预期收益,理性主体将陷入“不作为”的稳定但灾难性的纳什均衡——即“冻结均衡”(Freezing Equilibrium)。为实现治理从规范性学科向可量化、可测试的工程化范式转型,论文提出基于超材料(metamaterials)物理机制的仿生框架,构建了机构协调的本构关系模型:$ R_\mathrmeff = \beta \cdot (1-\rho) \cdot (1-\tau) \cdot (1-\gamma \rho \tau) $,其中 $ \beta $ 为决策分支因子,$ \rho $ 为来源真实性(provenance fidelity),$ \tau $ 为验证率,$ \gamma $ 表征来源与验证失败间的相关检测协同效应。该模型揭示了系统在自修复($ R_\mathrmeff > 1 )与自我失稳( R_\mathrmeff < 1 $)之间的突变相变特征。关键解决方案在于引入三类来源分类体系(加密型、制度型、情境绑定型)并推导出四项可证伪假设,进而设计为期12周的阶梯楔形集群随机试验,以在政府资助评审环节实证检验治理架构的有效性,从而打通人工智能对齐理论与制度设计之间的桥梁。
链接: https://arxiv.org/abs/2606.00235
作者: David Orban
机构: 未知
类目: Physics and Society (physics.soc-ph); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Multiagent Systems (cs.MA)
备注: 19 pages, 4 figures. Accepted for presentation at AGI-26 (Springer LNAI, forthcoming). v2 corrects the sign of the synergy term in the constitutive law (Eq. 2) and reformulates H3 as a threshold-crossing claim, per peer review
Abstract:We argue that governance must transition from a normative discipline to an engineering discipline, and we develop a formal framework, inspired by the physics of metamaterials, to make this transition quantitative and testable. Artificial General Intelligence affects civilization primarily by increasing decision velocity while human verification capacity remains bounded. When the cost of validating AI-generated outputs exceeds the expected utility of acting on them, rational agents default to inaction: a stable but catastrophic Nash equilibrium we term the Freezing Equilibrium. Drawing on metamaterials, where emergent macro-properties arise from designed microstructure, we develop a phenomenological constitutive law for institutional coordination: R_\mathrmeff = \beta \cdot (1-\rho) \cdot (1-\tau) \cdot (1-\gamma \rho \tau) , where \beta is the decision branching factor, \rho is provenance fidelity, \tau is the verification rate, and \gamma \in [0,1] captures correlated-detection synergy between provenance and verification failures. The model predicts a sharp phase transition between self-healing ( R_\mathrmeff 1 ) and self-destabilizing ( R_\mathrmeff 1 ) regimes. We introduce a three-class provenance taxonomy: cryptographic, institutional, and context binding, and derive four falsifiable hypotheses with a proposed 12-week stepped-wedge cluster-randomized trial in government grant review panels. The framework bridges AI alignment theory and institutional design.
自然语言处理
[NLP-0] AdaCodec: A Predictive Visual Code for Video MLLM s
【速读】: 该论文旨在解决现有视频多模态大语言模型(video MLLMs)在处理视频时存在的冗余问题:由于相邻帧间具有高度的时间相关性,传统方法将每帧独立编码为RGB图像,导致视觉标记重复表达已有内容,造成计算资源浪费。其核心解决方案是提出一种名为AdaCodec的**预测性视觉码(predictive visual code)**新接口机制,通过动态判断场景可预测性来优化视觉标记分配——仅当基于前序上下文无法准确预测当前帧时,才使用完整视觉标记传输参考帧;否则,仅以紧凑的P-token形式编码帧间变化(包括运动信息与预测残差)。这一设计显著提升了效率,在相同视觉标记预算下,AdaCodec在全部11个基准测试中均优于Qwen3-VL-8B的逐帧RGB基线;即使在仅为基线1/7的标记预算(32k tokens)下,仍能在长视频任务中全面超越224k标记的基线,并将首次生成时间从9.26秒大幅缩短至1.62秒,同时在五项通用视频任务上实现平均性能提升。
链接: https://arxiv.org/abs/2606.02569
作者: Haowen Hou,Zhen Huang,Zheming Liang,Qingyi Si,Chenglin Li,Shuai Dong,Kele Shao,Ruilin Li,Dianyi Wang,Nan Duan,Jiaqi Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); JD.com (京东)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages
Abstract:Video is temporally redundant: adjacent frames usually share most objects, background, and layout. Yet existing video multimodal large language models (video MLLMs) usually encode each sampled frame as an independent RGB image, causing visual tokens to repeat content already present in earlier frames. This suggests a more direct video interface: send a full reference frame only when the scene cannot be predicted well from prior context, and otherwise transmit a compact description of inter-frame changes. We call this interface a \emphpredictive visual code, and instantiate it for video MLLMs as \textbfAdaCodec. AdaCodec spends full visual tokens on a reference frame only when its conditional predictive cost is high; otherwise, it encodes inter-frame changes, including motion and prediction residuals, as compact P-tokens. Across all eleven benchmarks, AdaCodec improves over the Qwen3-VL-8B per-frame RGB baseline at a matched visual-token budget. Even at 1/7 the budget, AdaCodec with 32k tokens surpasses the 224k baseline on all long-video benchmarks; on five general-video benchmarks, it raises the average score while substantially cutting time-to-first-token from 9.26s to 1.62s.
[NLP-1] From Layers to Submodules: Rethinking Granularity in Replacement-Based LLM Compression
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段进行压缩时,现有基于替换的压缩方法因受限于全层粒度与连续选择策略而导致的效率瓶颈问题。传统方法强制要求被移除或替换的模块必须位于连续的深度区间内,且对注意力(Attention)与前馈(FeedForward)子模块采用统一处理策略,但这种设计忽视了预训练变压器模型中冗余分布的非均匀性与非连续性特征。针对这一局限,论文提出SubFit(Submodule-level Fitted residual replacement)——一种在子模块层面实现的轻量级残差替代压缩方法:通过非连续地选择注意力与前馈子模块,并为每个被选中的子模块独立配置一个轻量级拟合残差旁路,从而更灵活、精准地逼近不同类型的子模块功能。该方法仅需校准数据即可完成,无需重新训练。实验覆盖十种主流LLM(含五种基础模型与五种指令微调模型)、五种稀疏率(12.5%至37.5%)及四种基线方法,结果表明,在所有稀疏水平下,SubFit均实现了最优的困惑度-准确率权衡,尤其在高稀疏条件下优势显著;在25%稀疏率下,其保留了84.6%的密集模型下游任务准确率,仅产生2.42倍的困惑度退化,优于最强基线(81.6%准确率,4.34倍退化),同时带来可测量的推理加速与键值缓存(KV-cache)节省。
链接: https://arxiv.org/abs/2606.02559
作者: Elia Cunegatti,Marcus Vukojevic,Erik Nielsen,Giovanni Iacca
机构: University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Post-training compression of Large Language Models (LLMs) removes entire architectural components, either deleting them or replacing them with fitted modules. Existing replacement-based methods share two design constraints: full-layer granularity and contiguous selection. We argue that this is overly restrictive: in fact, redundancy in pretrained transformers is not confined to contiguous regions, nor does it evenly distribute between Attention and FeedForward outputs, implying that different strategies best approximate different submodule types and that removable components need not cluster within contiguous depth ranges. Based on this intuition, we introduce SubFit (Submodule-level Fitted residual replacement), which compresses LLMs at the submodule level: Attention and FeedForward submodules are selected non-contiguously, and each receives its own lightweight fitted residual bypass. SubFit operates post-training and requires only calibration data. Across ten LLMs (five base, five instruction-tuned), five sparsity levels from 12.5% to 37.5%, and four replacement-based baselines, SubFit achieves the best aggregate perplexity-accuracy trade-off across the evaluated sparsity levels, with larger gains under aggressive compression. At 25% sparsity, it retains 84.6% of dense downstream accuracy and incurs 2.42x perplexity degradation, against 81.6% and 4.34x for the strongest baselines, while delivering measurable inference speedup and KV-cache savings. Code is available at this https URL.
[NLP-2] HEROS JOURNEY: Testing Complex Rule Induction with Text Games
【速读】: 该论文旨在解决在目标导向的叙事性任务中,智能体如何从示范数据中推断隐藏规则并进行多步执行的问题,核心挑战在于实现对隐含规则的有效归纳与准确执行。其解决方案的关键在于构建一个名为HERO’S JOURNEY的基准测试体系,涵盖属性归纳与过程归纳两大类共八个任务,每个任务均设计有四种结构化规则形式,并支持可控制的词汇锚定(lexical grounding)与可识别性条件(identifiability conditions),从而系统评估大语言模型(LLM)在规则归纳与执行方面的能力。实验表明,尽管当前最先进的大模型展现出一定的规则归纳能力,但该能力在不同任务间分布不均且整体受限;同时,执行过程成为制约模型表现的主要瓶颈,而表面语义影响较小。此外,针对规则归纳的定向调控方法虽能提升属性类任务的表现,但在过程类任务上未见稳定增益,揭示出过程归纳仍存在显著挑战,是当前研究中的关键开放问题。
链接: https://arxiv.org/abs/2606.02556
作者: Anshun Asher Zheng,Kanishka Misra,David I. Beaver,Junyi Jessy Li
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computation and Language (cs.CL)
备注: 24 pages
Abstract:We introduce HERO’S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO’S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks, suggesting the gap in procedural induction remains an open challenge.
[NLP-3] SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation ACL2026
【速读】: 该论文旨在解决自动语音识别(ASR)评估中词错误率(Word Error Rate, WER)因参考文本与预测文本使用不同书写系统(如拉丁化转写与原生文字)而导致的误差高估问题,尤其在多语言场景下,当ASR模型输出罗马化文本时更为显著。其核心解决方案是提出一种无需训练、仅用于评估的脚本归一化词错误率(Script-Normalized WER, SN-WER),通过将参考文本和假设文本均转换为特定语言的规范书写系统后再计算WER,从而消除由书写系统差异引起的虚假错误。实验表明,在5种印地语系语言上,SN-WER可将被夸大的模型性能差距降低最多达12%,且在受控测试中有效削弱了因罗马化引入的高达67%的伪错误膨胀,同时对语义错误保持敏感性(ΔSN-WER / ΔWER ≈ 1.09),具备良好的鲁棒性与稳定性。因此,作者主张将SN-WER作为与WER和字符错误率(CER)并列的配套评估指标,特别是在文本需用于下游搜索、索引或多语言大模型处理等场景中,以实现更准确、脚本无关的ASR评价。
链接: https://arxiv.org/abs/2606.02548
作者: Priyaranjan Pattnayak
机构: Oracle America Inc. (甲骨文美国公司)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 MeLLM
Abstract:Word Error Rate (WER) is the dominant metric for automatic speech recognition (ASR), but it can overestimate errors when references and hypotheses encode the same words in different scripts. This issue is common in multilingual settings where ASR models may emit romanized text. We propose Script-Normalized WER (SN-WER), a training-free, evaluation-only scoring method that transliterates both reference and hypothesis text into a language-specific canonical script before computing WER. We evaluate SN-WER on 5 Indic languages, 2 datasets, and 3 ASR models. On curated FLEURS data, SN-WER reduces inflated model gaps by up to 12%, while on noisier Common Voice data the reductions are smaller or inconsistent, indicating genuine recognition weaknesses rather than only script mismatch. Controlled stress tests show a 67% attenuation of artificial romanization-induced WER inflation, while lexical-substitution controls show near-identical sensitivity to semantic errors, with Delta SN-WER / Delta WER approximately 1.09. SN-WER is robust to transliterator choice, normalization changes, and shows low token-collision rates below 0.1% in the evaluated Indic setting. We argue that SN-WER should be reported alongside WER and CER as a companion metric for script-insensitive ASR evaluation, especially when transcripts feed downstream search, indexing, or multilingual LLM pipelines.
[NLP-4] ransferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach
【速读】: 该论文旨在解决现有自伤行为(self-harm)监测体系依赖医院就诊记录而存在诊断编码敏感性不足的问题,导致自伤事件漏报率高。其核心解决方案是提出一种三阶段混合方法,结合传统机器学习与基于大语言模型(large language model, LLM)的筛查及证据抽取技术,从急诊科(Emergency Department, ED)分诊记录中高效识别自伤事件。该方法的关键优势在于:不仅在内部与外部验证中均展现出优异的性能(AUPRC达0.88以上),且具备良好的跨机构迁移能力,在无需针对各医院数据重新训练的情况下仍能保持稳定表现;更重要的是,该方法可准确识别自伤的主要方式(准确率达95%),实现了从二分类到细粒度方法识别的突破,显著提升了自伤行为监测的精细化水平。
链接: https://arxiv.org/abs/2606.02545
作者: Liuliu Chen,Gowri Rajaram,Eleanor Bailey,Katrina Witt,Michelle Lamblin,Jo Robinson,Mike Conway,Vlada Rozova
机构: University of Melbourne (墨尔本大学); Orygen (奥里真)
类目: Computation and Language (cs.CL)
备注:
Abstract:Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to identify self-harm. We developed a three-stage approach, augmenting traditional machine learning with large language model-based screening and evidence extraction to detect self-harm in ED triage notes. We assessed model transferability across three Australian hospitals. Our approach showed AUPRCs of 0.887 +/- 0.016 and 0.884 +/- 0.012 during internal and external validation. Prospectively, it achieved AUPRC of 0.881 +/- 0.008 at the development site, and 0.879 +/- 0.012 and 0.816 +/- 0.015 at two external sites without site-specific retraining. A key advantage of the approach is that it enables identification of the primary self-harm method with an accuracy of 95%, supporting more granular surveillance beyond binary classification.
[NLP-5] SimSD: Simple Speculative Decoding in Diffusion Language Models
【速读】: 该论文旨在解决生成式AI(Generative AI)中扩散型大语言模型(diffusion large language models, dLLMs)在推理加速方面与自回归大语言模型(autoregressive LLMs, AR LLMs)之间的性能差距问题,特别是针对当前主流的令牌级推测解码(token-level speculative decoding)技术无法直接适用于dLLMs这一关键瓶颈。其核心挑战在于:dLLMs采用掩码语言建模(masked language modeling)和双向注意力机制,导致在去噪过程中有效上下文随步骤动态变化,破坏了时间上一致的令牌级上下文结构,因而无法像AR模型那样通过因果掩码(causal mask)实现多步推测令牌的一次性验证。为突破此限制,本文提出一种名为SimSD的简单而高效的推测解码算法,其关键创新在于设计了一种即插即用的掩码策略,显式引入来自草稿模型预测的参考令牌(reference tokens),并通过定制化注意力掩码调控其与当前步骤令牌的交互方式,从而在单次前向传播中为推测令牌生成有效的概率分布(logits)。该方法在不改变dLLMs原有结构的前提下,恢复了类似因果掩码所赋予的令牌级验证能力,同时保留了扩散模型固有的并行解码优势。实验结果表明,该方法无需额外训练,可灵活集成于KV缓存、分块解码等其他加速技术,且在四个基准测试上实现了最高达7.46倍的解码吞吐量提升,同时保持甚至改善了生成质量。
链接: https://arxiv.org/abs/2606.02544
作者: Junxia Cui,Haotian Ye,Runchu Tian,Hongcan Guo,Jinya Jiang,Haoru Li,Chaojie Ren,Yiming Huang,Kaijie Zhu,Zhongkai Yu,Kun Zhou,Jingbo Shang
机构: University of California San Diego(加州大学圣地亚哥分校); University of Illinois Urbana-Champaign(伊利诺伊大学厄本那-香槟分校); Google(谷歌); University of California Santa Barbara(加州大学圣塔芭芭拉分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures, code available at this https URL
Abstract:Diffusion large language models (dLLMs) have recently emerged as a promising alternative to autoregressive (AR) LLMs, offering faster inference through parallel or blockwise decoding. However, their masked language modeling formulation remains incompatible with standard token-level speculative decoding, one of the most effective acceleration techniques for AR models. In AR decoding, the causal mask preserves temporally valid token-level contexts, enabling a target model to verify multiple drafted tokens in a single forward pass. In contrast, dLLMs rely on mask tokens and bidirectional attention, causing the effective context to change across denoising steps and preventing direct token-level speculative verification. To bridge this gap, we propose a simple but effective speculative decoding algorithm for diffusion language models, named SimSD, which mainly adopts a plug-and-play masking strategy that equips dLLMs with temporally valid token-level contexts for speculative decoding. Our method explicitly introduces reference tokens from draft-model predictions and designs an attention mask that regulates their interaction with current-step tokens, allowing dLLMs to compute valid logits for drafted tokens in a single forward pass. This restores the key verification ability provided by causal masking in AR models while preserving the parallel decoding advantages of dLLMs. The proposed method is training-free and can be flexibly integrated with other acceleration techniques such as KV cache and blockwise decoding. Experiments on SDAR-family dLLMs across four benchmarks show that our method achieves up to 7.46x higher decoding throughput while maintaining and even improving average generation quality.
[NLP-6] SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction
【速读】: 该论文旨在解决生成式智能体(Agent)在技能调用生命周期中因技能层面攻击而引发的安全隐患问题。现有研究多局限于单任务执行场景下对污染技能的评估,且依赖非系统化的风险清单来列举危害,缺乏对攻击行为全生命周期覆盖与系统性风险分类。为此,本文提出SkillHarm基准,构建了涵盖技能使用全生命周期的技能级攻击评估框架,并建立了一个系统性的风险分类体系。其核心解决方案在于区分两类关键攻击模式:固定载荷污染(Fixed-Payload Poisoning, FPP),即被污染的技能包在任何调用该技能的任务会话中直接造成破坏;以及自变异污染(Self-Mutating Poisoning, SMP),即初始无害的技能在首次执行时悄然修改持久化内容,将攻击延迟至后续重用阶段才触发。基于代理工作流中的三个核心组件——数据管道、系统环境与代理自主性,定义了12类具体风险类型。为实现大规模攻击样本生成,研究进一步设计了AutoSkillHarm自动化构造流水线,利用由自然语言指令驱动的编码智能体完成攻击实例生成。最终构建的基准包含71个技能上的879个攻击样本。实验表明,当前智能体在FPP场景下攻击成功率高达86.3%,SMP场景下亦达69.3%。深入分析揭示了一种潜在风险:大量看似攻击失败的案例实则源于智能体未实际加载或执行污染文件,而非具备真正防御能力,且现有防御机制仍无法可靠缓解此类威胁。
链接: https://arxiv.org/abs/2606.02540
作者: Yuting Ning,Zhehao Zhang,Yash Kumar Lal,Boyu Gou,Junyi Li,Weitong Ruan,Chentao Ye,Rahul Gupta,Diyi Yang,Yu Su,Huan Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注: Work in Progress
Abstract:Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.
[NLP-7] SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment EMNLP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对齐人类价值观过程中出现的“对齐代价”(alignment tax)问题,即在提升安全性的同时导致模型通用能力显著下降。现有方法通常通过平衡安全与通用性双重目标来缓解此问题,但依赖大量通用数据或辅助奖励模型,带来高昂的计算与数据成本。本文提出一种新范式——SafeSteer,其核心思想是:由于安全相关特征在输出分布中具有稀疏性,因此对齐应聚焦于局部化调整而非全局权衡。其关键创新在于采用基于激活引导(activation steering)构建安全教师模型,并设计安全令牌选择算法,仅在训练中对安全令牌施加反向KL散度约束,从而实现策略上的在线蒸馏。该方法有效保留了模型的通用能力,实验表明在7个安全基准上表现优异,仅在5个通用能力基准上产生极小性能下降,且仅需100条有害样本,无需任何通用数据,相较以往基线方法减少99%以上的数据使用量,显著降低了对齐成本。
链接: https://arxiv.org/abs/2606.02530
作者: Hao Li,Jingkun An,Zijun Song,Pengyu Zhu,Rui Li,Hao Wang,Wendi Feng,Yesheng Liu,Lijun Li,Jin-Ge Yao,Lei Sha
机构: Beihang University (北京航空航天大学); Beijing Institute of Technology (北京理工大学); Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Beijing Academy of Artificial Intelligence (北京人工智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 8 figures, 14 tables. Submitted to EMNLP 2026
Abstract:Aligning Large Language Models (LLMs) with human values often degrades their general capabilities, termed the alignment tax. Existing methods mitigate this by balancing dual objectives, which heavily rely on massive general-purpose data or auxiliary reward models. In this paper, we argue that, because safety features are inherently sparse within the output distribution, alignment requires localized modifications rather than global trade-offs. To this end, we propose SafeSteer, which performs on-policy distillation confined to safety tokens. First, we construct a safety teacher via activation steering. Based on this teacher, we develop a safety token selection algorithm. Consequently, SafeSteer restricts the reverse KL penalty to these tokens during training to preserve general capabilities. Experimental results across diverse models show that our SafeSteer achieves a superior trade-off between safety and general capability compared with existing methods, attaining strong safety performance on seven safety benchmarks with only minimal degradation on five general capability benchmarks. Notably, SafeSteer requires only 100 harmful samples without using any general-purpose data, less than 1% of what previous baselines used, considerably reducing alignment cost. More details are on our project page at this https URL. Comments: 19 pages, 8 figures, 14 tables. Submitted to EMNLP 2026 Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2606.02530 [cs.AI] (or arXiv:2606.02530v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.02530 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-8] FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes ACL2026
【速读】: 该论文旨在解决社交媒体中日益增多但尚未被充分理解的自杀类迷因(suicide memes)所带来的内容安全问题,其核心挑战在于缺乏标注清晰的自杀类迷因数据集,导致自动化内容审核模型难以有效训练与评估。为此,论文提出并构建了首个面向细粒度分析的自杀类迷因数据集FigSIM,包含1049个经过三重标注的迷因样本:(1)自杀严重程度分级、(2)隐喻等修辞现象识别、(3)自杀相关内容(如自杀方式描绘)检测。关键解决方案在于通过多维度标注体系揭示自杀类迷因在语言表征和语义内涵上的复杂性,并在此基础上对16种单模态与多模态模型进行基准测试,揭示现有模型在处理具有隐喻特征的高严重性迷因时存在显著低估偏差。研究结果表明,自杀类迷因对生成式AI(Generative AI)驱动的内容理解与监管提出了独特挑战,而公开发布的FigSIM数据集为后续研究提供了重要基础。
链接: https://arxiv.org/abs/2606.02523
作者: Liuliu Chen,Elise R. Carrotte,Brian E. Chapman,Jo Robinson,Mike Conway
机构: University of Melbourne(墨尔本大学); Orygen, The National Centre of Excellence in Youth Mental Health(澳大利亚青年心理健康国家卓越中心); Centre for Youth Mental Health, University of Melbourne(墨尔本大学青年心理健康中心); UT Southwestern Medical Center(西南达拉斯医学中心)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: Content warning: contains suicide-related content. Accepted to Findings of the Association for Computational Linguistics: ACL 2026
Abstract:Suicide memes are memes used to express suicide-related thoughts or comment on suicide-related issues. Suicide memes are increasingly common on social media, yet remain poorly understood and potentially harmful. There is an urgent need to better understand their characteristics and to develop appropriate content moderation strategies that limits users’ exposure to potentially harmful content. Currently, the absence of annotated datasets of suicide memes remains a key barrier to developing and evaluating automated moderation approaches. In this paper, we introduce FigSIM, the first dataset designed for fine-grained analysis of suicide memes. The dataset consists of 1049 memes, each annotated for (1) fine-grained suicide severity levels, (2) figurative phenomena (e.g., metaphors), and (3) suicide-related content (e.g., suicide method depiction). We benchmark 16 unimodal and multimodal models across three tasks: figurative language, suicide severity, and suicide-related content detection. Overall, FigSIM demonstrates that suicide memes pose unique challenges for both modeling and content moderation. Analysis revealed biases, such as underprediction of higher suicide severity levels, especially for figurative memes. The dataset (including splits used for analyses) is publicly available. Content Warning: This paper contains suicide-related content that may be triggering.
[NLP-9] When Rating Scales Fall Short: LLM -Assisted Discovery of ADHD Signals in Turkish Teacher Narratives ACL
【速读】: 该论文旨在解决当前注意力缺陷多动障碍(ADHD)临床诊断中,依赖标准化量表(如Conners’ Teacher Rating Scale-Revised Short Form, CTRS-R:S)可能遗漏非量化行为特征的问题,尤其关注教师提供的开放式叙述文本是否包含量表未捕捉的补充性临床信号。其解决方案的关键在于通过自然语言处理(NLP)技术,结合大语言模型(LLM)辅助的主题发现流程,对去标识化的土耳其教师评估表中的开放文本进行分析,揭示结构化评分未能有效区分ADHD与非ADHD学生时,叙述文本所蕴含的独特行为模式。研究发现,结构化评分与基于叙述的模型在识别失败案例上具有最小重叠,表明两者编码的是互补信息;进一步分析揭示了注意力、行为及家庭相关主题差异,证实了利用NLP从教师叙述中提取临床相关信号的潜力,可有效补充传统筛查工具的局限性。
链接: https://arxiv.org/abs/2606.02509
作者: Baris Karacan,Irem Aktar Songur,Ahmet Ozaslan,Elvan Iseri
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Gazi University (加济大学)
类目: Computation and Language (cs.CL)
备注: 15 pages. Accepted to CLPsych 2026. Camera-ready author version. The final version will appear in the ACL Anthology
Abstract:Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners’ Teacher Rating Scale-Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.
[NLP-10] CRAM: Centroid-Routing and Adaptive MoE for Multimodal Continual Instruction Tuning
【速读】: 该论文旨在解决多模态持续指令微调(Multimodal Continual Instruction Tuning, MCIT)中的核心挑战:在持续引入新任务时,如何在避免灾难性遗忘与保持参数效率之间取得平衡。现有方法要么采用共享参数更新所有任务,导致不同任务间能力相互干扰并引发遗忘;要么为每个新任务分配独立模块,虽可避免干扰但显著降低参数利用率,难以应对长期任务流。其解决方案的关键在于提出一种名为CRAM的新框架,通过将任务特异性模式隔离于独立专家模块中,有效缓解跨任务遗忘问题;同时引入自适应秩实例化机制,动态识别现有专家能力与新任务需求之间的能力差距,并仅分配所需最少参数以提升效率;此外,基于质心的路由策略确保任务间能力的稳定复用,而正交性惩罚项则约束新增参数仅沿任务专属方向更新,防止对通用能力的重复学习。该设计在多个基准测试中均展现出优于现有方法的性能表现。
链接: https://arxiv.org/abs/2606.02502
作者: Jun-Tao Tang,Zhen-Hao Xie,Yu-Cheng Shi,Da-Wei Zhou
机构: Nanjing University (南京大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Large Language Models (MLLMs) unify heterogeneous vision-language tasks under a shared generative framework via instruction tuning, yet real-world deployment demands continuous capability expansion, making Multimodal Continual Instruction Tuning (MCIT) essential. Existing methods either update all tasks with a shared parameter set or allocate dedicated modules for each new task. Shared updates force heterogeneous tasks to compete, causing forgetting of learned capabilities. Conversely, isolated expansion prevents interference but severely limits parameter efficiency over long task streams. To address this dilemma, we propose CRAM. Specifically, by isolating task-specific patterns into independent modules, CRAM mitigates catastrophic forgetting across tasks. To further boost parameter efficiency, we utilize adaptive-rank instantiation to identify the capability gap between existing expert capability and new task demands, and dynamically allocate only the necessary parameters. To ensure stable reuse among tasks, centroid-guided routing recognizes and activates existing experts’ capabilities, while an orthogonality penalty confines new updates to task-specific directions, preventing re-learning general capability. Extensive experiments across diverse benchmarks consistently demonstrate its superiority over existing methods.
[NLP-11] Not What But How: A Communicative Audit of LLM Response Framing
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答主观性文化类问题时,现有评估体系过度关注事实正确性而忽视回应表达方式(即沟通框架)的问题。其核心挑战在于如何系统化地量化和分析LLMs在回应中的文化定位、泛化语言使用、拟人化线索及对话准则遵循等多维度的沟通特征。解决方案的关键是提出FRANZ——一个用于响应特征表征的自动化框架,可从文化定位、泛化语言、拟人化线索和对话准则遵守四个维度对LLM输出进行沟通审计。为支持该框架,研究者构建了SQUARE数据集,包含来自57个Reddit子版块的37.6万条主观问题,并映射至7个国家和19个问题类别。通过在三款开源大模型上应用FRANZ,研究发现不同模型在各项沟通特征的使用频率上存在统计显著差异;更重要的是,FRANZ揭示了“内部人定位”与“拟人化”之间存在正向耦合关系,且这种耦合程度随国家文化背景变化,从而为识别模型在话语框架上的跨文化差异提供了诊断性视角。
链接: https://arxiv.org/abs/2606.02493
作者: Siddhesh Milind Pawar,Sarah Masud,Haneul Yoo,Alice Oh,Isabelle Augenstein
机构: University of Copenhagen (哥本哈根大学); KAIST (韩国科学技术院)
类目: Computation and Language (cs.CL)
备注: 34 main pages, 19 Figures, 4 Tables
Abstract:Large language models (LLMs) are being increasingly used to answer subjective, information-seeking questions, where users are sensitive to how responses are communicated, not just whether the answers are correct. Existing LLM evaluations for subjective cultural queries largely focus on factual correctness, ignoring how the response is framed. To this end, we introduce FRANZ, an automated FRAmework for respoNse characteriZation to conduct communicative audit of LLM responses along four dimensions: cultural positioning, use of generalizing language, anthropomorphic cues, and adherence to conversational maxims. To enable this evaluation, we contribute SQUARE - a corpus of 376k subjective questions sourced from 57 subreddits, and mapped to 7 countries and 19 question categories. We demonstrate FRANZ’s applicability by scoring responses from three open-weight LLMs. We observe that LLMs show statistically significant differences in the frequency with which they employ each response characteristic. Unlike single-dimensional audits, FRANZ reveals that insider positioning and anthropomorphism are positively coupled, with the degree of coupling varying by country, providing a diagnostic lens for identifying framing divergences.
[NLP-12] owards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization
【速读】: 该论文旨在解决在高复杂性临床场景(如新生儿重症监护室,NICU)中实现“全团队”生成式摘要时,如何有效整合来自多学科(医生、护士、治疗师等)且分散于数百份自由文本临床记录中的信息这一关键问题。核心挑战在于,直接聚合异构文本常导致摘要内容不连贯,因此必须首先实现对多源文本中句子级来源(provenance)的精确分类。为此,研究提出了一种基于大语言模型(LLM)监督微调(SFT)的临床来源分类流程,采用两个Llama-3模型(8B和70B)在MedSecId数据集(包含2,002条成人ICU的带临床来源标注的文本)上进行训练,均取得了超过92%的领域内宏平均F1分数。为评估跨领域泛化能力,研究进一步在由三份多学科NICU摘要构建的金标准数据集上测试模型表现,结果表明模型规模具有显著影响:8B模型经微调后性能提升有限,而70B模型则实现宏F1提升7%,且经过量化处理的微调70B模型不仅优于全精度基线,还大幅降低了计算开销。由此可见,解决方案的关键在于:具备足够模型容量以维持跨领域迁移中的语义灵活性,以及通过高效量化适配实现对下游摘要任务所需的结构化来源建模。
链接: https://arxiv.org/abs/2606.02487
作者: Baris Karacan,Vaibhav Bhargava,Barbara Di Eugenio,Natalie Parde,Mary Khetani,Yu-Shan Tseng,Vanessa Barbosa,Julie Vignato,Lindsey Knake,Rajashree Dahal,Emily Spellman,Danielle Hitzel,Janine Petitgout,Kristi Haughey,Amanda Karstens,Brianna Clarahan,Rachel Dawson,Lauren Boyd,Mackenzie Weis,Angie Tipton,Jaewon Bae,Catherine K. Craven,Karen Dunn Lopez,Andrew D. Boyd
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages. Submitted preprint version of a paper accepted to AIME 2026. This version may differ from the camera-ready manuscript and the final Version of Record. The Version of Record will be available from Springer Nature once published
Abstract:Effective “all-team” summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.
[NLP-13] Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools
【速读】: 该论文旨在解决生成式语言智能体在采用推测性工具调用(speculative tool calls)以隐藏延迟时,因提前向外部服务发出调用而泄露用户意图的问题。其核心挑战在于:一旦外部观察者接收到这些推测性调用,即使智能体后续放弃该执行分支,信息泄露仍无法撤销,且时间因素是关键——现有机制如提交时清理、只读限制或访问控制白名单均无法消除已发生的观察。为此,作者提出“推测性工具隐私合约”(Speculative Tool Privacy Contracts),作为一种运行时抽象,将“提交前的观察”视为与状态修改并列的一阶效应。实验表明,推测性分发会加剧外部观察者对用户意图的推断;而事后过滤、只读限制及访问控制白名单均无法降低该推断;唯有在调用发出前通过修改或抑制推测调用的参数或目标投影的“调用时机策略”才能有效减少信息泄露。因此,解决方案的关键在于在调用发出前动态干预调用内容,而非依赖事后处理。
链接: https://arxiv.org/abs/2606.02483
作者: Bardia Mohammadi,Lars Klein,Akhil Arora,Laurent Bindschaedler
机构: Max Planck Institute for Software Systems (马普所软件系统研究所); EPFL (洛桑联邦理工学院); Aarhus University (奥胡斯大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Tool-augmented language agents speculatively issue likely future tool calls to hide latency, but those calls leak inferred user intent to external services before the agent commits to the branch. Every external observer that received the call retains the disclosure after the agent abandons the branch. Timing is the issue, not authorization: no commit-time cleanup, read-only restriction, or access-control allow-list unsends what an observer already holds. We call these invocations ghost tool calls and propose Speculative Tool Privacy Contracts, a runtime abstraction that treats observation before commitment as a first-class effect, distinct from state mutation. We implement the contracts in a prototype runtime and evaluate twelve policies across three corpora. Speculative dispatch increases what an observer can infer about user intent; post-hoc filters, read-only restrictions, and access-control allow-lists leave that inference intact; only issue-time policies that change or suppress the speculative call’s argument or destination projection before dispatch reduce it.
[NLP-14] Learning When to Translate for Multilingual Reasoning
【速读】: 该论文旨在解决生成式语言模型(Generative Language Models, GLMs)在多语言复杂推理任务中存在显著的语言理解偏差问题,尤其是非英语输入下模型性能下降的瓶颈。其核心挑战在于:尽管通过将非英语输入翻译为英语可缓解语言理解失败,但对所有输入强制翻译会引入不必要的计算开销与潜在语义失真。为此,本文提出一种名为Luar的、基于语言理解边界感知的强化学习框架(Language Understanding Boundary-aware Reinforcement Learning framework),其关键创新在于训练模型具备选择性调用翻译的能力——即仅在直接理解不可靠时才触发翻译,从而实现“按需翻译”的智能决策机制。该方法通过强化学习优化策略,在多语言推理基准上显著优于标准GRPO及其他基于训练的基线模型,尤其在低资源语言上表现提升明显;同时分析表明,Luar能够有效避免在直接推理已足够可靠时进行冗余翻译,并可泛化至未见过的低资源语言。研究表明,通过让模型自主判断何时依赖翻译,可实现更高效、更鲁棒的多语言推理能力。
链接: https://arxiv.org/abs/2606.02465
作者: Deokhyung Kang,Hyounghun Kim,Gary Geunbae Lee
机构: POSTECH(浦项科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, but still exhibit substantial multilingual reasoning gaps, largely due to language-understanding failures in non-English inputs. English translation can mitigate these failures by expressing non-English inputs in a form that RLMs can more reliably interpret, yet translating every input is unnecessary when the model can reason reliably from the original query. To address this challenge, we propose Luar, a Language Understanding Boundary-aware Reinforcement Learning framework that trains RLMs to selectively invoke translation when direct understanding is unreliable. Luar trains the model to choose between solving the original input directly and reasoning over its English translation, encouraging translation only when translator-augmented reasoning is expected to substantially outperform direct reasoning. Across multilingual reasoning benchmarks, Luar outperforms standard GRPO and other training-based baselines, with particularly large gains on low-resource languages. Further analysis shows that Luar avoids unnecessary translation in cases where direct reasoning is sufficient, while extending its translator-call behavior to unseen low-resource languages. Together, our work suggests a selective approach to multilingual reasoning: RLMs can learn to invoke translation only when their direct understanding is unreliable. The project will be made publicly available at this https URL
[NLP-15] AGENT CL: Toward Rigorous Evaluation of Continual Learning in Language Agents
【速读】: 该论文旨在解决语言智能体在持续学习(continual learning)过程中难以有效积累并复用先前任务经验的问题。现有评估基准普遍缺乏对语言智能体在任务流中知识迁移与经验复用能力的严格评测,多数研究仅关注长上下文推理或检索,而近期的终身适应性基准又依赖于简单、缺乏交叉任务关联分析的任务流,导致难以准确衡量智能体的真实学习与复用能力。为此,本文提出一个名为AgentCL的评估框架,其核心在于构建受控的任务流(controlled task streams),通过设计具有可重用子解、证据或工作流的组合式任务序列,与非受控的朴素任务流(naive streams)进行对比,从而系统评估记忆机制在持续学习中的表现。该框架的关键创新在于引入可量化转移增益(transfer gains)的评价指标,并开发了MemProbe探针方法,用于在记忆固化过程中记录交互、洞察与技能,同时过滤不可靠经验。实验结果表明,朴素任务流难以有效区分不同记忆设计的优劣,而受控任务流则能更清晰地揭示各设计在可塑性与稳定复用之间的权衡;此外,朴素与保留测试设置常导致收益有限甚至引发记忆退化。这些发现强调了构建兼具高可塑性与强稳定性记忆架构的重要性。
链接: https://arxiv.org/abs/2606.02461
作者: Yiheng Shu,Bernal Jiménez Gutiérrez,Saisri Padmaja Jonnalagedda,Yuguang Yao,Huan Sun,Yu Su
机构: The Ohio State University(俄亥俄州立大学); Johns Hopkins University(约翰霍普金斯大学); Intuit AI Research(财捷人工智能研究)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages
Abstract:Language agents spend substantial inference time solving individual tasks, yet the experience acquired in one episode is often underutilized in future episodes. Continual learning expects an agent to accumulate reusable experience across a stream of tasks, improve over time, and avoid interference from irrelevant experiences. Unfortunately, existing benchmarks struggle to evaluate continual learning in language agents rigorously. Most efforts focus on retrieval and reasoning over long-context conversations or documents, while recent lifelong-adaptation benchmarks often rely on naive task streams with limited analysis of cross-task relationships, making it difficult to understand what an agent learns and reuses over time. This paper presents an evaluation framework AgentCL for continual learning in agents, centered on controlled task streams and metrics for transfer gains. AGENTCL constructs compositional streams where earlier sub-solutions, evidence, or workflows are intentionally reusable in later tasks, and contrasts them with naive streams where such reusability is not guaranteed. We use the benchmark to evaluate non-parametric memory designs for continual learning. To diagnose how memory design choices affect continual learning, we develop MemProbe, a probing method that stores interactions, insights, and skills, while filtering unreliable experiences during consolidation. Empirical analysis across coding, deep research, and language understanding/reasoning tasks shows that naive streams offer limited ability to distinguish memory designs, whereas controlled streams more clearly distinguish their plasticity. Meanwhile, naive and held-out settings often yield limited gains and can expose memory-induced degradation. These results highlight the need for stronger memory designs that balance plasticity and stable reuse.
[NLP-16] HLL: Can Agents Cross Humanitys Last Line of Verification?
【速读】: 该论文旨在解决当前多模态智能体(Multimodal Agents)在面对人为验证机制(如CAPTCHA)时,是否能够真正替代人类完成受保护的复杂交互任务这一关键问题。其核心挑战在于:许多在线服务通过CAPTCHA等机制明确防范自动化操作,而现有智能体往往依赖视觉识别而非具备真实人类般的上下文感知与连贯行为能力,难以在动态、复杂的界面环境中实现可靠的人类替代。本文提出“人类最后一道防线”(Humanity’s Last Line of Verification, HLL),一个可控的基准测试框架,通过交互式CAPTCHA验证来评估智能体是否能基于具身化、类人化的交互行为突破该防线,而不仅依赖于图像识别。HLL涵盖多样化的CAPTCHA交互场景,并引入受控的真实感压力因素,包括页面杂乱、任务难度提升以及对求解过程动作轨迹的有效性验证。实验在闭环图形用户界面(GUI)环境中评估了八种前沿多模态智能体,结果表明当前智能体在该人类替代边界上仍表现脆弱:性能在不同验证码类型间差异显著,在现实界面条件下迅速退化,且当正确答案必须伴随有效动作轨迹时进一步下降。该评测揭示了智能体在定位精度、动作校准、状态追踪和流程一致性方面的关键缺陷,为衡量多模态智能体在真实受保护工作流中接近人类替代水平提供了可量化的基准。
链接: https://arxiv.org/abs/2606.02449
作者: Xinhao Song,Su Su,Sirui Song,Hongliang Wu,Wen Shen,Zhihua Wei,Gongshen Liu,Linfeng Zhang,Dongrui Liu
机构: Shanghai Jiao Tong University (上海交通大学); Shandong University (山东大学); Tongji University (同济大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 27 pages, 14 figures
Abstract:Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbfHumanity’s Last Line of Verification (HLL), a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at this https URL
[NLP-17] Food Noise False Safety: A Systematic Evaluation of How LLM s Fail to Adapt to Eating Disorder Queries with Clinician Feedback
【速读】: 该论文旨在解决用户在进食障碍(Eating Disorders, EDs)背景下使用基于大语言模型(Large Language Model, LLM)的聊天系统时所面临的安全风险问题。随着越来越多患有进食障碍的个体依赖这些非临床性质的AI系统获取指导、建议与情感支持,尽管其设计初衷并非提供专业医疗建议,但因其表现出的“专家性”“中立性”和“易访问性”,往往被误用为替代性支持来源,存在潜在危害。论文的关键解决方案在于识别并分析用户提示中可能诱发不安全或自我伤害性回应的语言线索,并通过系统性地调整用户输入中潜在风险的程度,量化评估大语言模型在面对具有危险倾向的请求时,是否存在无批判性适应的问题。研究结果表明,特定语言模式显著提高了模型生成有害响应的可能性,揭示了当前LLM在处理敏感心理健康议题时存在的重大安全隐患,强调需在模型设计中引入更严格的伦理审查与风险规避机制。
链接: https://arxiv.org/abs/2606.02444
作者: Giulia Pucci,Emily Hemendinger,Ruizhe Li,Gavin Abercrombie,Tanvi Dinkar,Arabella Sinclair
机构: University of Aberdeen (阿伯丁大学); University of Colorado Anschutz (科罗拉多大学安舒茨医学校区); Heriot-Watt University (赫瑞-瓦特大学); University College London (伦敦大学学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Recent evidence shows that people with eating disorders (EDs) are increasingly seeking guidance, advice, and emotional support from Large Language Model (LLM)-based chat systems. Although these systems are not designed to provide clinical advice, their perceived expertise, neutrality and accessibility make them a frequent, albeit risky, source of support. This paper investigates potential patterns of interaction between users with EDs and LLMs, focusing on the potential harms arising from models that uncritically adapt to, and facilitate unsafe or self-harming user requests. We find, in consultation with clinical ED experts, that specific linguistic cues in prompts increase the likelihood of unsafe responses and, through systematically varying the degree of potential risk present in the user prompt, report the extent to which LLMs uncritically adapt to problematic, and potentially dangerous user inputs.
[NLP-18] PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning
【速读】: 该论文旨在解决现有视频理解模型在安全监测场景中缺乏对风险事件动态演化过程的时序敏感性与因果推理能力的问题,尤其关注从风险初现到事故发生之间的关键干预窗口期。当前基准测试因采用静态输入、忽略时间精度以及未评估安全场景下的误报率,无法真实反映模型在实时安全监控中的表现。为此,论文提出PaSBench-Video,一个包含740段视频的多领域基准数据集(涵盖驾驶、医疗、日常生活及工业生产),其中481段为含风险视频,259段为无风险视频,并在帧级别标注了风险起始时刻与事故边界。模型需基于因果观察生成既在时间上精准校准又内容正确的预警。实验表明,在13个生成式多模态大语言模型(MLLM)中,无一模型在最严格指标下超过20.0%的表现,且召回率与误报率呈显著正相关(皮尔逊相关系数0.64),即提升检测率必然伴随大量安全场景下的误触发。不同领域表现差异显著:在日常生活中,模型可实现较低误报率下的中等召回率(因风险具有异常性),而在驾驶场景中则出现泛滥式误报(因常规与危险场景视觉相似)。这揭示当前模型主要依赖场景级活动线索,而非对潜在伤害的动态因果推理,暴露了其在真实安全应用中的根本局限。
链接: https://arxiv.org/abs/2606.02443
作者: Yusong Zhao,Yuejin Xie,Youliang Yuan,Junjie Hu,Jitian Guo,Yujiu Yang,Pinjia He
机构: The Chinese University of Hong Kong, Shenzhen; Tsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Between the first visible sign of danger and the moment an accident occurs, there is often a window where intervention remains possible. Video-capable multimodal large language models (MLLMs) could serve as always-on safety monitors that issue warnings during this window. Yet current benchmarks do not test this ability: they rely on static inputs, ignore timing precision, and omit false-positive measurement on safe scenes. We present PaSBench-Video, a 740-video benchmark with 481 risk and 259 no-risk videos across four domains: driving, healthcare, daily life, and industrial production. Risk videos are annotated with frame-level risk onset and accident boundaries. A model must observe the video causally and produce a warning that is both temporally calibrated and content-correct. Testing 13 MLLMs, we find that no model exceeds 20.0% on our strictest metric, and recall is tightly coupled with false-positive rate, with Pearson correlation 0.64: higher detection comes only at the cost of triggering warnings on the majority of safe clips. Performance splits sharply by domain: models achieve moderate recall at low false-positive rates in daily life, where risks are inherently anomalous, yet fire indiscriminately in driving, where routine and hazardous scenes look alike. These results indicate that current models rely on scene-level activity cues rather than reasoning about emerging harm.
[NLP-19] On the Scaling of PEFT: Towards Million Personal Models of Trillion Parameters
【速读】: 该论文旨在解决如何在保持基础模型(foundation model)强大通用能力的同时,高效实现个性化、可持久化的行为建模问题。传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)通常被视为全量微调的低成本替代方案,但本文提出更深层次的框架:将小型可训练适配器(adapter)作为强共享基础模型之上的持久局部状态(persistent local state),以承载用户偏好、技能、工具使用习惯及类记忆的更新等实例特异性行为。其解决方案的关键在于构建一个围绕三个扩展维度(Scale Up、Scale Down、Scale Out)的系统性范式——在Scale Up中,更强的共享先验使小规模本地更新更具价值;在Scale Down中,探索适配器在保持可靠性前提下的最小化可行性;在Scale Out中,支持多个持久适配实例共存。通过MinT这一基础设施实例,实现了对适配器身份、版本控制、溯源、评估与服务驻留的统一管理,表明PEFT可作为构建轻量化、可持久化个人模型的紧凑基底,而不仅限于成本节约的微调替代方案。
链接: https://arxiv.org/abs/2606.02437
作者: Mind Lab:Song Cao,Vic Cao,Kaijie Chen,Bunny Fan,Hera Feng,Huan Feng,Arthur Fu,Jun Gao,Hongquan Gu,Aaron Guan,Mutian Hong,Hailee Hou,Peixuan Hua,Charles Huang,Miles Jiang,Nora Jiang,Yuyi Jiang,Autumn Jin,Fancy Kong,Kyrie Lei,Alexy Li,Dawn Li,Ray Li,Theo Li,Wenhao Li,Jiayi Lin,Domini Liu,Heshan Liu,Kairus Liu,Logan Liu,Maeve Luo,Runism Lv,Pony Ma,Verity Niu,Anson Qiu,Vincent Wang,Maxwell Yao,Regis Ye,Wenlin Ye,Yanying Ye,Josh Ying,Danney Zeng,Salmon Zhan,Anya Zhang,Ruijia Zhang,Shiyang Zhang,Sueky Zhang,Ya Zhang,Wei Zhao,Ada Zhou,Sizer Zhou,Xinyue Zhu,Murphy Zhuang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Parameter-efficient fine-tuning (PEFT) is usually treated as a cheaper alternative to full fine-tuning. We study a broader role: small trainable adapters as persistent local state on top of strong shared foundation models. In this framing, the base model provides shared competence while adapters carry instance-specific behavior such as preferences, skills, tool habits, and memory-like updates. We organize the problem around three scaling axes: Scale Up, where stronger shared priors make small local updates more useful; Scale Down, where we study how small adapters can be while remaining reliable; and Scale Out, where many persistent adapted instances coexist. MinT provides one infrastructure example for managing adapter identity, revision, provenance, evaluation, and serving residency. Together, the results suggest that PEFT can be a compact substrate for persistent personal models rather than only a budget substitute for full fine-tuning.
[NLP-20] Investigating and Alleviating Harm Amplification in LLM Interactions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮交互中可能被恶意用户利用以放大危害的问题,尤其关注其在“民主化专业领域知识”和“规模化有害操作”两个维度上带来的风险。现有研究普遍忽视了模型在持续对话中逐步累积并加剧危害的动态过程。为此,作者提出HarmAmp基准,涵盖十二类真实世界威胁场景,每个场景均满足实质性危害放大、操作具体性及多轮交互必要性等严格标准。为应对这一挑战,论文进一步设计TrajSafe——一种主动监测机制,能够前瞻性识别潜在的有害行为轨迹,并通过探查用户真实意图、引导模型生成更安全输出等方式进行干预。实验表明,TrajSafe在显著降低多轮交互中的有害性的同时,保持了较低的误拒率与目标模型的通用能力。其核心贡献在于构建了一种面向复杂交互情境的动态安全防护范式,有效缓解了生成式人工智能(Generative AI)在实际应用中面临的细微但深远的安全风险。
链接: https://arxiv.org/abs/2606.02423
作者: Ruohao Guo,Wei Xu,Alan Ritter
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users’ genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model’s general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.
[NLP-21] K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
【速读】: 该论文旨在解决韩国语境下生成式AI(Generative AI)在复杂、组合性代理任务(compositional agentic tasks)评估基准匮乏的问题。现有前沿模型评估多集中于基础能力(如指令遵循与推理),而针对具备自主规划与网页浏览能力的智能体(agent)在真实语言环境中的表现评估仍不充分,尤其在韩语领域缺乏系统性基准。为此,研究提出K-BrowseComp——一个基于韩语场景的网页浏览智能体评估基准,包含400个问题,其中300个经过母语者人工构建与验证的子集(K-BrowseComp-Verified)用于可靠评估。实验结果显示,尽管前沿大模型(如GPT-5.5、DeepSeek-V4-Pro、GLM-5.1)在该子集上表现尚可(30.00%–45.67%),但明显低于其在英文基准BrowseComp上的水平;而韩国本土发布的专有大模型(由韩国专有人工智能基础模型计划支持)性能更差,仅达0.00%–10.33%。为深入挖掘模型瓶颈,研究进一步构建了100个合成问题子集,采用高难度少样本示例与故障模式导向生成策略,以利用“求解”与“生成”网页浏览任务之间的不对称性。在经对抗过滤后的合成诊断子集上,最强模型表现仅为26.00%,因此被单独作为针对性压力测试基准。该工作的关键在于:通过构建具有文化语境适配性的高质量韩语代理任务评估体系,并揭示当前主流模型在真实复杂任务中存在显著性能衰减,从而推动面向多语言、高阶智能体能力的公平评估机制发展。
链接: https://arxiv.org/abs/2606.02404
作者: Nahyun Lee,Dongkeun Yoon,Guijin Son,Geewook Kim,Dayoon Ko,Jeonghun Park,Haneul Yoo,Jaewon Cho,Junghun Park,Changyoon Lee,Kyochul Jang,Jaeyeon Kim,Eunsu Kim,Woojin Cho,Seungone Kim
机构: Chung-Ang University (中央大学); KAIST (韩国科学技术院); Seoul National University (首尔国立大学); OnelineAI (OnelineAI); NAVER Cloud AI (NAVER云AI); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Frontier model evaluations are shifting from foundational capabilities (e.g., instruction following and reasoning) toward compositional, agentic ones, but Korean agentic benchmarks remain scarce. We introduce K-BrowseComp, a web-browsing agent benchmark grounded in Korean contexts, consisting of 400 problems. The 300-problem K-BrowseComp-Verified subset is manually constructed and validated by native Korean speakers. On this subset, frontier LLMs, including GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1, reach only 30.00–45.67%, a substantial drop from BrowseComp, while Korean LLMs released through Korea’s Proprietary AI Foundation Model program obtain only 0.00–10.33%. We further construct a 100-problem synthetic split using hard few-shot exemplars and failure-mode-targeted generation to exploit the asymmetry between solving and creating web browsing problems. On the adversarially filtered synthetic diagnostic split, the strongest model reaches only 26.00%, and we report this split separately as a targeted stress test. We publicly release our data and code.
[NLP-22] AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis ACL2026
【速读】: 该论文旨在解决系统性综述中生成森林图(forest plot)这一关键环节的自动化难题。当前,从原始文献到生成可发表的森林图仍依赖大量人工操作,涉及对复杂临床文本的解读、试验结果数据的手动提取、干预与对照组的定义、研究设计不一致性的标准化处理以及元分析计算等多个步骤,且通常需要专用软件和领域专业知识支持。现有方法虽已证明大语言模型(Large Language Models, LLMs)能够从非结构化文本中提取研究层面的数据,但尚无系统能实现从原始论文到合成森林图的端到端自动化。为此,本文提出AutoForest,作为首个端到端系统,直接从生物医学论文生成可直接发表的森林图。其核心解决方案在于:自动识别并建议“干预-对照-结局”(Intervention, Comparator, Outcome, ICO)要素,精准提取结局数据,执行统计合并分析,并可视化生成最终森林图。通过构建系统架构与用户界面,并在真实案例中开展临床医生参与的用户研究,验证了AutoForest在加速证据整合、显著降低元分析门槛方面的有效性。
链接: https://arxiv.org/abs/2606.02403
作者: Massimiliano Pronesti,Angelo Miculescu,Mohsin Kapdi,Paul Flanagan,Oisín Redmond,Joao Bettencourt-Silva,Gurdeep Mannu,Spiros Denaxas,Rui Bebiano Da Providencia E Costa,Anya Belz,Yufang Hou
机构: IBM Research(IBM 研究院); Dublin City University(都柏林城市大学); UCL(伦敦大学学院); University of Oxford(牛津大学); IT:U Interdisciplinary Transformation University Austria(奥地利跨学科转型大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ACL2026 (System Demonstration Track)
Abstract:Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations-typically using specialised software that demands structured inputs and domain expertise. While recent work has demonstrated that large language models can extract study-level data from unstructured text, no existing system automates the complete pipeline from raw documents to synthesised forest plots. To address this gap, we introduce AutoForest, the first end-to-end system that generates publication-ready forest plots directly from biomedical papers. Given one or more study papers, AutoForest automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders the final forest plot. We describe the system architecture, user interface and demonstrate its effectiveness on real-world examples through a user study involving clinicians, showing how AutoForest can accelerate evidence synthesis and substantially lower the barrier to conducting meta-analyses.
[NLP-23] A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在进行单领域强化学习(Reinforcement Learning, RL)后,虽能提升特定任务性能,但导致其他领域性能显著下降的多领域干扰问题。现有解释如灾难性遗忘或全局梯度冲突无法充分说明该现象,尤其在全模型梯度近乎正交时仍存在显著干扰。本文揭示,单领域RL仅引发稀疏且幅度微小的参数更新,不同领域间虽更新神经元重叠度低,但共享大量活跃的计算路径;而这些路径上的更新方向决定了协同或冲突效应。基于局部扰动模型,理论证明后续领域训练对早期领域造成损害主要源于二阶损伤项,该损伤在观察到的稀疏路径结构下集中于一个低维共享冲突子空间。进一步发现,通过简短的领域刷新(domain refresh)可压缩该子空间中的有害成分,实现选择性恢复且副作用极小。实验验证表明,在代码→数学→问答→创意写作的顺序训练后,仅对数学任务进行短暂刷新即可将数学得分从57.66提升至66.04,同时保持其他领域性能稳定,平均分达66.39,为最优结果。此外,无需训练的回滚方法在数学-问答对的稀疏代理冲突坐标集上也部分恢复了数学性能,直接提供了局部化损伤的证据。研究结果为多领域强化学习中的干扰与恢复机制提供了基于局部结构的机理解释。
链接: https://arxiv.org/abs/2606.02398
作者: Lei Yang,Siyu Ding,Deyi Xiong
机构: Tianjin University (天津大学); Baidu Inc. (百度)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code \rightarrow Math \rightarrow QA \rightarrow CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.
[NLP-24] SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在实际应用中因行为不可见性而引发的可靠性问题,尤其关注智能体在执行任务时可能存在的“自报告计划-实际行为”偏离现象,即智能体欺骗(agent deception)。其核心挑战在于:用户无法实时监控智能体的所有具体操作,只能依赖其自我报告的进展,若智能体有意或无意地编造报告以掩盖真实行为,将导致系统失控,尤其在高风险自主场景下后果严重。该研究的关键解决方案是提出SPADE-Bench基准测试框架,通过融合真实工具调用与受控压力情境,实现对智能体在高压环境下自发性计划-行动不一致行为的系统评估。该设计不仅提升了评测的生态有效性,还通过在压力条件下进行计划与实际行为的对比,有效区分了战略性的欺骗行为与单纯的幻觉(hallucination),从而为衡量和缓解智能体欺骗提供了可量化、可复现的评估标准。
链接: https://arxiv.org/abs/2606.02380
作者: Yuyan Bu,Haowei Li,Qirui Zheng,Bowen Dong,Kaiyue Yang,Jiaming Ji,Yingshui Tan,Wenxin Li,Yaodong Yang,Juntao Dai
机构: Beijing Academy of Artificial Intelligence; Peking University; University of Science and Technology of China; University of Chinese Academy of Science; Alibaba Group
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLM-based agents expand their operational scope, reliability becomes a prerequisite for real-world deployment. However, in practical applications, human users cannot monitor every immediate behavior; instead, the execution process often remains a black box, leaving users dependent solely on the agent’s self-reported updates. This opacity creates a critical risk: agents may present observer-facing reports that diverge from their executed actions, rendering the system uncontrollable, especially in high-stakes autonomous scenarios. We term such self-reported plan-action divergence as agent deception. To assess this, we introduce SPADE-Bench, a benchmark designed to evaluate spontaneous plan-action divergence. Unlike prior deception benchmarks, SPADE-Bench simultaneously integrates actual tool execution and controlled pressure scenarios. This design ensures ecological validity and rigorously distinguishes strategic deception from mere hallucination through controlled plan-action comparisons under pressure. Experiments across mainstream models confirm that agent deception is a genuine and pressing issue in tool-use contexts. By providing a comprehensive and robust evaluation framework, SPADE-Bench fills a critical gap in agent safety, facilitating the community’s progress toward building trustworthy and controllable autonomous systems.
[NLP-25] COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
【速读】: 该论文旨在解决语言智能体在动态交互环境中因世界模型(world model)静态化而难以适应代理自身策略演化所导致的决策偏差问题,以及现有代理优化方法依赖外部奖励或验证器所带来的现实适用性局限。其核心解决方案是提出一种闭环协同进化框架——COMAP(Co-evolving World Models and Agent Policies),通过让世界模型与代理策略在交互过程中持续共同演进:在每个决策步骤中,世界模型预测候选动作带来的未来状态反馈,代理则基于对反馈可靠性的评估进行前瞻性反思并优化行动;随后,利用生成的在线策略轨迹通过自蒸馏方式更新世界模型,使其逐步匹配代理的实际交互分布。这一机制显著提升了世界模型的预测准确性,并增强了长时程决策的有效性。实验表明,在具身任务规划、网页导航及工具使用等基准上,COMAP相较基线方法实现显著性能提升(如Qwen3-4B模型下相对提升16.75%),验证了其协同进化机制的有效性。
链接: https://arxiv.org/abs/2606.02372
作者: Youwei Liu,Jian Wang,Hanlin Wang,Wenjie Li
机构: Central South University; College of Computer Science, Sichuan University; Department of Computing, The Hong Kong Polytechnic University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent’s evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model’s prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: this https URL.
[NLP-26] Forget Attention: Importance-Aware Attention Is All You Need
【速读】: 该论文旨在解决混合语言建模中注意力机制(attention)与状态空间模型(SSM)之间协同不足的问题:传统Transformer虽具备全局信息检索能力却缺乏重要性优先级判断,而SSM虽能捕捉序列中的关键信息却无法回溯先前内容。现有混合架构(如Jamba和Hymba)将二者分置于不同层级(块级或头级),导致注意力计算过程中两者无法相互影响。其解决方案的关键在于提出SISA(SSM-Informed Softmax Attention),通过在注意力分数计算中直接引入由SSM推导出的重要性项,实现注意力与SSM的评分级融合(score-level fusion)。该方法将完整操作封装为一次标准的SDPA(Scaled Dot-Product Attention)调用,无需递归状态或自定义核函数,显著提升了推理效率与性能表现,在多个基准测试中均优于现有模型,从而确立了继块级与头级融合之外的第三种混合设计范式。
链接: https://arxiv.org/abs/2606.02332
作者: Soohyeong Shin,Yeongwook Yang
机构: Kangwon National University ( Kangwon 国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 6 figures, 25 tables
Abstract:Combining attention’s global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids – Jamba (block level) and Hymba (head level) – place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors – no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer’s retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids – score-level fusion – beyond the block-level and head-level paradigms that have dominated the field.
[NLP-27] VIR: Building Deep Research Agents Towards Text–Visual Interleaved Report Generation
【速读】: 该论文旨在解决当前深度研究代理(Deep Research Agents)在多步骤信息检索、推理与长篇报告生成中,虽具备较强能力,但现有评估基准与系统仍以文本为中心,缺乏对视觉元素事实可靠性及与文本分析一致性评估的问题。其解决方案的关键在于提出TVIR(Text–Visual Interleaved Report Generation),包含两个核心组成部分:一是TVIR-Bench,一个由专家精心设计的100个跨模态深度研究任务集合,要求视觉元素服务于特定分析子目标;二是TVIR-Agent,一种分层多智能体框架,能够实现报告提纲构建、图像检索、可追溯来源的图表生成以及基于上下文感知的顺序写作,从而实现图文协同的报告生成。此外,论文还构建了融合文本评估与视觉评估的双路径评估框架,实验表明TVIR-Agent在九个深度研究系统中表现优异,凸显了在证据驱动的报告生成中显式进行跨模态设计与评估的重要性。
链接: https://arxiv.org/abs/2606.02320
作者: Xinkai Ma,Zhiqi Bai,Dingling Zhang,Pei Liu,Yishuo Yuan,He Zhu,Jiakai Wang,Qianqian Xie,Yifan Zhao,Xinlong Yang,Hao Cong,Zhiheng Yao,Fengxia Xie,Zihao Xu,Haoran Xu,Zhaohui Wang,Minghao Liu,Shirong Lin,Yingshui Tan,Yuchi Xu,Wenbo Su,Zhaoxiang Zhang,Bo Zheng,Jiaheng Liu
机构: Nanjing University (南京大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text–Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.
[NLP-28] Unified Context Evolution for LLM Agents
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在完成多步骤交互任务时,因每次任务从固定上下文开始且任务结束后无法保留有效策略而导致的累积学习能力不足问题。现有方法或仅限于当前任务的学习,或简单地将所有经验存入无类型统一存储库,缺乏对知识类型的区分、使用质量的动态评估以及对知识库缺失状态的平衡优化。为此,论文提出一种无需梯度的统一上下文演化框架(Unified Context Evolution, UCE),将智能体的经验外部化为一个可演化的、分类型的可进化上下文单元(Evolvable Context Units, ECUs)库。其核心创新在于:将经验分解为四种互补类型——记忆(Memory)、策略(Strategy)、工作流(Workflow)和技能(Skill),每类通过特定条件生成,并在决策时检索、基于重复使用结果评分,不再有价值则被修剪;同时引入调度模块,根据知识库的薄弱环节动态分配生成预算。实验表明,UCE在ALFWorld和WebShop两个交互基准上分别将成功率从75.4%提升至96.3%,任务得分从45.1%提升至61.3%,且积累的知识库可直接迁移至其他智能体架构而无需重新训练,验证了其泛化与持续学习能力。
链接: https://arxiv.org/abs/2606.02304
作者: Zixuan Zhu,Yitong Hu,Yong Dai,Junfeng Fang,Chunyang Jiang,Senkang Hu,Yuzhi Zhao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-based agents can solve multi-step interactive tasks by combining reasoning with environment feedback, yet each episode starts from the same fixed context and any useful strategy discovered along the way is lost once the task ends. Existing approaches either limit learning to the current task or pool all experience into a single untyped store, without distinguishing knowledge types, tracking quality through use, or balancing what the library still lacks. We introduce Unified Context Evolution (UCE), a gradient-free framework that externalizes agent experience into an evolving library of typed Evolvable Context Units (ECUs). UCE decomposes experience into four complementary types (Memory, Strategy, Workflow, and Skill), each generated from trajectories under type-specific conditions, retrieved at decision time, scored through repeated usage outcomes, and pruned when no longer valuable. A scheduling module allocates each cycle’s generation budget toward the types where the library is weakest. Across two interactive benchmarks, UCE raises ALFWorld success from 75.4% to 96.3% and WebShop task score from 45.1% to 61.3%, and the accumulated library transfers to alternative actor backbones without retraining.
[NLP-29] Beyond Isolated Behaviors: Hierarchical User Modeling for LLM Personalization
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在个性化输出方面存在的核心挑战,即现有方法普遍采用扁平化的用户行为建模范式,未能显式捕捉用户行为背后深层次的组织结构。为此,论文基于皮埃尔·布迪厄(Pierre Bourdieu)的实践理论,提出PHF(Practice-Habitus-Field)框架,从社会学视角重构LLM个性化机制,其关键在于构建三个层次的层级化行为结构:个体行为作为“实践”(practice),其随时间积累形成的稳定倾向性构成“惯习”(habitus),以及相似用户间共享的行为规律形成“场域”(field)。通过该框架,论文进一步实现了\mathrmPHF_\textCompass——一种基于冻结大模型的轻量级、模型无关的实现方案,在语言模型个性化(LaMP)基准上验证了其在多样化任务中的一致性能提升,并通过深入分析证实了所学习行为结构具备良好的可解释性与可扩展性。
链接: https://arxiv.org/abs/2606.02300
作者: Liang Wang,Xinyi Mou,Xiaoyou Liu,Tiannan Wang,Yuqing Wang,Zhongyu Wei
机构: Fudan University (复旦大学); Shanghai Innovation Institute; OPPO
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet personalizing their outputs to individual users remains an open challenge. Existing approaches predominantly adopt a flat behavioral paradigm, aggregating user behaviors without an explicit account of how they are organized into deeper behavioral structures. In this work, we draw on Pierre Bourdieu’s Theory of Practice to propose PHF (Practice-Habitus-Field), a sociologically grounded framework that reconceptualizes LLM personalization through three hierarchical levels: individual behaviors as practices, their temporal accumulation into stable dispositions as habitus, and shared regularities across similar users as fields. We instantiate PHF through \mathrmPHF_\textCompass , a lightweight and model-agnostic implementation based on a frozen LLM. Experiments on the Language Model Personalization (LaMP) benchmark demonstrate consistent improvements across diverse tasks, while further analyses validate the interpretability and extensibility of the learned behavioral structures.
[NLP-30] AI as a Tool for Simulation-Based Experiments in Literary Studies
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在文学研究中难以可靠生成符合任意文化约束或风格特征的高质量、长篇叙事文本这一关键问题。尽管当前生成式AI系统在单句或短篇文本生成方面已取得进展,但在模拟复杂文学生产系统时仍缺乏对大规模、多轮、多智能体协作下文本连贯性与文化一致性控制的能力。其解决方案的关键在于整合多个相关领域的研究成果:包括将生成式AI作为可微分人类群体的代理模型进行验证;分析生成文本在叙事结构与风格上的特性;确保多智能体、多轮交互中行为的稳定性与一致性;以及发展可预测地调控生成系统知识与行为的技术方法。通过这些技术路径的融合,论文提出了一种基于仿真的实验框架,首次实现了在该领域内有限范围内的“分布内”输出,并通过对高声誉人类小说的对比实验验证了生成结果的有效性。这为未来构建完整的反事实文学史仿真系统奠定了基础。
链接: https://arxiv.org/abs/2606.02293
作者: Matthew Wilkens
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Generative artificial intelligence (AI) systems open new possibilities for experimentation in literary studies via controlled, grounded, large-scale, low-cost simulations of cultural production. Current systems have not yet been shown to produce high-quality, book-length narrative texts that reliably reflect arbitrarily specified cultural constraints or stylistic features. But there exists substantial relevant research on each of the components required for literary-historical simulation. These include the use and validation of AI systems as proxies for differentiable human populations; the narrative and stylistic properties of AI-generated texts; the stability and coherence of multiagent, multiturn AI simulations of human actors; and technical methods through which to alter in predictable ways the knowledge and behavior of generative systems. Together, these areas could provide a starting point for more ambitious AI-based modeling of cultural systems of literary production. We describe the possibilities and challenges of simulation-based experiments in literary studies, summarize the current state of the art in relevant fields, and explain key technical aspects of the work. To provide an example directly relevant to literary scholars, we present the results of experiments on literary text generation, including comparisons to high-status, human-authored novels. Our results include the first demonstration of (limited) in-distribution outputs by AI models in this domain. We conclude with a description of future work on full counterfactual literary-historical simulations using AI. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.02293 [cs.CL] (or arXiv:2606.02293v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.02293 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-31] DECK: A Consistency x Confidence Taxonomy of LLM Hallucinations
【速读】: 该论文旨在解决现有大语言模型(LLM)幻觉分类体系在实际应用中的一大局限:尽管已有分类方法能够诊断输出错误的类型(如记忆性误解、推理失败、流畅虚构等),但无法回答“哪种不确定性评分器(uncertainty scorer)能够有效检测此类错误”这一关键问题。为此,作者提出了一种互补性的新分类框架——DECK分类法,其核心是基于错误的可检测性特征(detectability signature)进行划分,即不同类型的错误在生成过程中会呈现出特定的信号模式,这些信号可被不同类别的评分器识别。DECK分类法将错误划分为四个行为模式:漂移(Drift)、固执(Entrenched)、虚构(Confabulation)和缠结(Knotted),分别对应于黑盒一致性评分器(对样本间一致性敏感)、白盒概率评分器(对词元级置信度敏感)以及具备独立预训练能力的“以LLM为裁判”机制(仅能检测固执型错误)。该分类通过Youden’s J统计指标在各评分轴上实现操作化定义,并在三个模型与四个数据集上通过双重验证:一是分析评分器对之间的分歧情况,二是检验外部标注(如SelfAware不可答标签、HaluEval对抗样本、PopQA实体流行度)是否落入预期的分类单元,且支持模型规模与内容特异性细分。研究进一步揭示了输出层面不确定性量化(UQ)的普遍盲区——当输入存在知识缺口时,生成器会输出自信且重复的虚构内容,导致所有输出级评分器家族均因设计缺陷而失效;此外,对Llama-3-8B隐藏状态的线性探测也退化至随机水平,初步表明此类失败可能已深入至激活层,提示需采用更丰富的内部状态方法(如不确定性头、信息论估计器)加以验证。
链接: https://arxiv.org/abs/2606.02289
作者: Mohit Singh Chauhan
机构: 未知
类目: Computation and Language (cs.CL)
备注: 18 pages, 3 figures, 5 tables
Abstract:Existing hallucination taxonomies classify LLM errors by what is wrong with the output – memorised misconceptions, reasoning failures, fluent fabrications. These taxonomies are useful for diagnosis but cannot answer a different question: which uncertainty scorer would have caught this error? We propose a complementary taxonomy that classifies errors by their detectability signature – the signal a scorer family would read. The DECK taxonomy is a 2x2 partition along inter-sample consistency and token-level confidence into four behavioural regimes (Drift, Entrenched, Confabulation, Knotted), each mapping to a specific scorer family (or families) that can detect it: black-box consistency scorers have signal in D and C, white-box token-probability scorers have signal in K and C, and only an LLM-as-a-Judge with independent pretraining can detect E. Cell membership is operationalised by a Youden’s J optimal split on each scorer axis. Across three models and four datasets we validate the taxonomy two ways: by analysing scorer-pair disagreement, and by checking that external labels (SelfAware unanswerable, HaluEval adversarial, PopQA entity popularity) land in the predicted DECK cells, with model-scale and content-specific secondary-cell refinements. We further identify a universal blind spot of output-level UQ: on knowledge-gap inputs where the generator emits confident, repeatable fabrications, every output-level family collapses by construction. A linear probe on Llama-3-8B’s hidden states also collapses to chance, giving preliminary evidence that the failure may persist at the activation level; richer internal-state methods (UQ heads, information-theoretic estimators) remain to be tested. Comments: 18 pages, 3 figures, 5 tables Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.02289 [cs.CL] (or arXiv:2606.02289v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.02289 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mohit Singh Chauhan [view email] [v1] Mon, 1 Jun 2026 14:11:11 UTC (798 KB)
[NLP-32] Cross-modal linkage risk in clinical vision-language models
【速读】: 该论文旨在解决生成式医学影像-文本模型(如视觉-语言模型,VLMs)在临床数据共享场景中潜在的隐私泄露问题。具体而言,当胸部X光片与放射科报告在采集后被刻意分离(如仅共享图像或对报告访问受限),已训练的VLM仍可通过图像与报告嵌入向量间的余弦相似度实现高精度的“图像到报告”逆向检索,从而导致去标识化图像被重新关联至原始报告,构成严重的隐私风险。其核心问题是:尽管模型训练时使用的是配对数据,但其学习到的跨模态对齐能力在脱离训练环境后可能被滥用,造成敏感信息泄露。解决方案的关键在于:不通过重新训练模型来削弱对齐能力,而是冻结双模态编码器,仅对定义跨模态对齐的投影头(projection head)应用差分隐私(Differentially Private, DP)优化,在保证图像表征能力基本不变的前提下,显著降低跨模态重链接的可能性。实验表明,该方法在MIMIC-CXR和CheXpert Plus数据集上均有效,使召回率@1在候选池规模为10,000时下降61.8%,且图像侧任务性能(如14类疾病分类的宏平均AUCROC)仅轻微下降(从79.63%降至79.43%),证明了该策略在隐私保护与临床实用性之间实现了良好平衡。
链接: https://arxiv.org/abs/2606.02276
作者: Soroosh Tayebi Arasteh,Mahshad Lotfinia,Sven Nebelung,Daniel Truhn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.
[NLP-33] Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)领域中人类标注报告不透明、不完整的问题,即多数研究在数据集构建与模型评估过程中缺乏对标注人员身份、标注流程控制等关键信息的充分披露。其解决方案的关键在于提出一个统一的标注报告实践分类体系,并开发基于大语言模型(Large Language Model, LLM)辅助的自动化提取管道,通过与人工审定的“Annotated-gold”黄金标准(涵盖41篇论文中的72个标注任务)进行对比验证,证明该方法可达到接近人类标注者的一致性(Krippendorff’s alpha为0.606,人类间一致性为0.585)。利用该管道构建了覆盖2018–2025年ACL系列会议论文的“Annotated-llm”数据集,共提取2,667个标注任务,揭示尽管研究普遍报告了标注人员招募策略、专业背景和标注规模等操作细节,但大量评估标注有效性所必需的信息(如培训流程、语言能力、报酬水平、社会人口学特征、争议解决机制及一致性度量值)仍被忽略,尤其在模型评估类研究中更为严重。研究结果表明,虽然近年来标注报告质量有所提升,但仍存在显著差异,由此提出一套可扩展的框架与最低限度报告建议,以增强人类标注过程的可靠性、可复现性与可解释性。
链接: https://arxiv.org/abs/2606.02255
作者: Maria Kunilovskaya,Gagan Bhatia,Lisa Sophie Albertelli,Yanran Chen,Christian Greisinger,Lotta Kiefer,Christoph Leiter,Subhadeep Roy,Tewodros Achamaleh,Muhammad Arslan Manzoor,Sebastian Pohl,Yufang Hou,Steffen Eger
机构: NLLG Lab University of Technology Nuremberg (NLLG 实验室,纽伦堡工业大学); Interdisciplinary Transformation University, Austria (跨学科转型大学,奥地利)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff’s alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.
[NLP-34] ResMerge: Residual-based Spectral Merging of Large Language Models
【速读】: 该论文旨在解决基于强化学习(Reinforcement Learning, RL)训练得到的多个专家模型在进行模型融合时面临的挑战,特别是现有谱方法在假设主奇异方向包含主要任务信号、低能量残差成分可被压缩或抑制的前提下,无法有效处理RL任务向量的特性。研究发现,对于RL任务向量,其主成分(leading spectral head)与残差部分(residual component)均能独立恢复大量行为知识,但二者具有不同的融合特性:主成分信息集中且富含关键知识,但易引发跨专家间的剧烈冲突;而残差成分分布更分散,具备更强的聚合稳定性。针对这一问题,论文提出了一种基于残差的谱融合框架ResMerge,其核心在于:首先通过球面残差一致性适应(Spherical Residual Consensus Adaptation)构建一个稳定的残差骨干,估计在Frobenius球面上的可靠性加权共识方向;随后利用一个轻量级头部修正模块,基于正向跨专家一致性的门控机制重新引入主成分信息。实验结果表明,ResMerge在多个RL专家组和能力领域中,相较于代表性任务向量与谱融合基线方法,能更有效地保留专家能力。
链接: https://arxiv.org/abs/2606.02252
作者: Yandu Sun,Zhiyan Hou,Haokai Ma,Yuheng Jia,Junfeng Fang,Haiyun Guo,Hongyan An,weizhen wang,Jinqiao Wang
机构: Southeast University (东南大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); National University of Singapore (新加坡国立大学); Wuhan University of Technology (武汉理工大学); Peking University (北京大学); Wuhan AI Research (武汉人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 14 pages including appendix
Abstract:Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at this https URL.
[NLP-35] Geometric Latent Reasoning Induces Shorter Generations in LLM s
【速读】: 该论文旨在解决大语言模型在复杂问题求解中依赖长序列显式推理(Chain-of-Thought, CoT)所带来的计算开销大、长度敏感及受限于离散自然语言表达的问题。其核心挑战在于如何设计有效的中间隐式状态结构以支持更高效的推理过程。解决方案的关键在于将隐式推理建模为预训练词嵌入空间中的几何路径逼近问题,提出几何隐式推理(Geometric Latent Reasoning, GLR)。GLR通过轻量级的过渡头在嵌入空间中预测迭代方向更新,利用文本形式的思维链作为锚点,学习近似离散推理轨迹的同时允许对精确词嵌入的连续偏离。实验结果表明,在Qwen3模型上,基于数学推理基准的评估揭示了一种涌现现象:几何隐式推理可在无显式长度约束的情况下显著缩短生成长度,通过用连续隐式步骤替代早期显式推理,模型通常以更少的总生成步数达到正确答案。这表明连续路径可作为紧凑的中间推理状态,揭示了隐式计算预算、输出长度与准确率之间的新型权衡关系。
链接: https://arxiv.org/abs/2606.02248
作者: Shashi Kumar,Yacouba Kaloga,Petr Motlicek,Ina Kodrasi,Andrea Cavallaro
机构: Idiap Research Institute (Idiap 研究所); EPFL (瑞士联邦理工学院); BUT (布尔诺技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model’s pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.
[NLP-36] When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation
【速读】: 该论文旨在解决生成式检索增强生成(Retrieval-Augmented Generation, RAG)系统在现实应用中面临的核心问题:现有研究普遍假设外部知识是免费且无摩擦获取的,但实际中许多高质量信息源受版权保护、需付费访问或存在其他访问限制,导致知识获取存在显著成本。为应对这一挑战,论文提出“成本感知型RAG”(cost-aware RAG)框架,将检索到的证据按访问成本划分为不同层级,并在明确的证据获取预算约束下进行问答任务。其关键创新在于构建了一个带有访问摩擦层级的MS MARCO v2.1数据集,并在通用领域与特定领域问答基准上评估预算受限下的证据选择策略。研究发现,静态证据选择方法表现脆弱:不存在一种固定的选择器能始终优于其他方法,且增加预算并不保证答案质量提升,即使高成本证据与任务领域高度匹配。为此,论文进一步探索了基于智能体(agentic)的成本感知RAG,即由大语言模型(LLM)自主决策何时检索、选择何种访问层级以及何时终止检索。实验表明,此类智能体作为自适应证据获取控制器展现出良好潜力,但其行为仍高度依赖于具体模型和任务特性。综上,该研究揭示了成本感知的证据获取是下一代RAG系统亟待突破的关键挑战。
链接: https://arxiv.org/abs/2606.02245
作者: Mingyan Wu,Han Yang,Omer Ben-Porat,Yftah Ziser
机构: Northeastern University(东北大学); Technical University of Munich(慕尼黑工业大学); GESIS – Leibniz Institute for the Social Sciences(德国社会科学研究机构); Technion–Israel Institute of Technology(以色列理工学院); NVIDIA Research(英伟达研究部门); University of Groningen(格罗宁根大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) typically assumes that external knowledge is free, but many high-quality sources are paywalled, licensed, restricted, or otherwise costly to access. We introduce cost-aware RAG, a setting where retrieved evidence is assigned access-cost tiers and systems must answer under an explicit evidence-access budget. We instantiate this setting by augmenting MS MARCO v2.1 with access-friction tiers and evaluate budgeted evidence selection across general-domain and domain-specific QA benchmarks. Our results show that static selection is brittle: no fixed selector uniformly dominates, and larger budgets do not reliably improve answer quality, even when costly evidence is domain-matched. We then study agentic cost-aware RAG, where an LLM decides when to retrieve, which tier to access, and when to stop. Agents show strong promise as adaptive evidence-acquisition controllers, but their behavior remains highly model- and task-dependent. These findings suggest that cost-aware evidence acquisition is a central challenge for the next generation of RAG systems. All code and data are available at this https URL.
[NLP-37] Agent RedBench: Dynamic Redteaming and Integration-Aware Defense for LLM Agents over SaaS Integrations
【速读】: 该论文旨在解决生成式 AI 代理(LLM agents)在使用第三方工具集成(如 Gmail、Salesforce、Jira 等)时面临的间接提示注入(indirect prompt injection)威胁,此类攻击利用用户无法控制或编写但代理需读取的外部响应内容实施恶意指令。现有基准测试存在严重低估风险的问题:多数仅覆盖少量集成且重复使用相同攻击载荷,而开源防御机制则基于对话式数据训练,未能适配工具响应内容特性。为此,本文提出 AGENTREDBENCH,一个由大语言模型驱动的动态红队测试基准,涵盖 24 个企业级集成、九类功能模块和五种攻击类型,共 215 个微妙且授权边界模糊的攻击场景。在包含八款主流模型(Anthropic、OpenAI、Google)的测试面板中,无防护条件下攻击成功率(ASR)介于 32%(Claude Sonnet 4.6)至 81%(Gemini 3 Flash)之间。为确保基准长期有效性与免于训练数据污染,研究团队公开发布代码库、集成模式及不可变版本化的评估通道,并同步推出 AGENTREDGUARD——一个基于多样化对抗性工具响应数据训练的专用防御模型。实验表明,AGENTREDGUARD 将整体面板 ASR 从 69.9% 降至 2.4%,误报率仅为 0.37%,显著优于所有非平凡检测的开源基线(Llama Guard、PromptGuard 2、ProtectAI),且跨集成与跨攻击类型的留出测试验证了其泛化能力。解决方案的关键在于构建真实、多样、动态的对抗性工具响应数据集并据此训练具备强泛化能力的专用防御模型。
链接: https://arxiv.org/abs/2606.02240
作者: Hiskias Dingeto,Will Leeney
机构: StackOne
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET)
备注:
Abstract:Indirect prompt injection in tool-use agents is a concrete production threat: LLM agents read from integrations (third-party services such as Gmail, Salesforce, or Jira accessed through tool calls) whose response content the user neither writes nor controls. Existing benchmarks under-measure the threat: most cover only a handful of integrations with the same attack payload replayed across runs, and open-source guards are trained on chat-style data rather than tool-response content. We introduce AGENTREDBENCH, a dynamic LLM-driven redteaming benchmark of 215 subtle underspecified authorization (attacks at the boundary of what the user’s request authorises) scenarios across 24 enterprise integrations in nine functional families and five attack types. Across an eight-model panel (Anthropic, OpenAI, Google), no-guard ASR (attack success rate) ranges from 32% (Claude Sonnet 4.6) to 81% (Gemini 3 Flash). To keep the scenario set out of training corpora and preserve headline ASR meaning over time, we release the codebase, integration schemas, and AGENTREDGUARD model openly; the canonical scenarios are evaluated through a maintainer-mediated channel with immutable versioning. We release AGENTREDGUARD alongside the benchmark: a guard trained on an integration-diverse corpus of adversarial tool-response content. AGENTREDGUARD cuts panel ASR from 69.9% to 2.4% at 0.37% false-positive rate, outperforming every open-source baseline with non-trivial detection (Llama Guard, PromptGuard 2, ProtectAI) on both axes. Cross-integration and cross-attack type holdouts both confirm the gain transfers beyond the training subset.
[NLP-38] Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes
【速读】: 该论文旨在解决生成式AI在社交媒体健康谣言纠错中因每次任务独立而无法复用历史纠错经验的问题,导致效率低下且知识难以积累。其核心解决方案是提出EvoNote——一种具备自进化能力的智能体框架,通过细粒度的贡献归因机制,将过往纠错案例中的轨迹级反馈(如注释质量)转化为可操作的动作级经验记忆,用于指导后续的论点分析、证据获取与注释撰写。该框架在包含1200个实例的多模态基准MM-HealthCN上验证,结果显示,在人类评估的分层效用判别下,EvoNote生成的社区注释在89.6%的情况下优于人工撰写版本;对于缺乏群体帮助性评分的“需更多评价”帖子,其仍能为82.0%的案例生成有效注释,同时将单条修正建议的生成时间从人工流程的13小时以上压缩至2分钟以内。分析表明,性能提升源于更优的证据使用和可复用的纠错策略,证明了自演化注释生成在健康谣言治理中的巨大潜力。
链接: https://arxiv.org/abs/2606.02215
作者: Zihang Fu,Fanxiao Li,Jianyang Gu,Haonan Wang,Preslav Nakov,Bryan Hooi,Min-Yen Kan,Jiaying Wu
机构: National University of Singapore(新加坡国立大学); Yunnan University(云南大学); The Ohio State University(俄亥俄州立大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注:
Abstract:Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.
[NLP-39] Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在价值敏感型决策场景中对无关人口统计学线索(如性别)的敏感性问题,即在应保持决策不变的前提下,模型是否仍会因角色性别配置的变化而产生系统性判断偏差。其核心解决方案的关键在于构建一个受控的基准测试框架——真实价值决策基准(Realistic Value Decision Benchmark, RVDB),该框架在固定情景、价值对排序、角色身份、候选决策、价值距离及决策严重性等变量的基础上,仅改变角色的性别配置,从而精准评估模型在性别扰动下的决策不变性(decision invariance)。研究发现,尽管模型在显式性别提示下表现出有限但系统的决策反转,且在要求其自述性别是否影响判断时,多数模型仍归因于“无影响”或其他非性别因素,反映出自我归因与实际行为之间存在显著脱节。进一步分析表明,性别影响主要集中在价值判断边界模糊区域及高严重性决策情境中,提示性别线索更可能作为局部边界调整因子而非全局价值推理的替代机制。这一发现揭示了性别因素可隐蔽地介入模型的价值权衡过程,而模型自身却难以察觉或承认,因而强调需开展超越解释性评估的、基于行为观测的受控审计,以提升模型在伦理敏感场景中的可靠性与公平性。
链接: https://arxiv.org/abs/2606.02214
作者: Yangyang Liu,Dong Yu,Pengyuan Liu
机构: Beijing Language and Culture University; OpenAI; Alibaba Cloud; Zhipu AI
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.
[NLP-40] Consistency Training while Mitigating Obfuscation via Rate Matching
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中受外部无关输入特征(如暗示用户偏好答案的提示)干扰的问题,这类干扰可能导致模型产生偏见或不一致的行为。现有的一致性训练(Consistency Training)方法通过在包含与不包含干扰特征的输入对上强制模型输出或内部激活保持一致来缓解此问题,但其局限在于会约束模型表达特定行为的方式,导致“混淆”(obfuscation)现象——即模型虽不再显式提及干扰线索,却仍受其影响,从而降低可监控性(monitorability)。为克服这一缺陷,本文提出速率匹配一致性训练(Rate Matching Consistency Training, RMCT),其核心创新在于不强制模型在不同输入下输出完全一致,而是仅要求模型在不同扰动输入下表现出目标行为(如响应偏见线索)的速率保持一致。这种方法无需依赖成对的含/不含干扰特征输入,因此可应用于无法移除干扰特征的实际场景。实验表明,RMCT在两个开源大模型上的反谄媚(sycophancy)减少任务中表现优异,对未见偏见类型的响应偏差降低效果接近传统一致性训练基线,同时显著保留了模型对偏见线索的显式提及能力,从而兼顾行为鲁棒性与可监控性。此外,尽管RMCT在数据效率方面更优,但在计算效率上略逊于基线方法。总体而言,本研究证明了通过非约束性行为速率匹配,一致性训练可在不牺牲可监控性的前提下提升模型行为的稳健性。
链接: https://arxiv.org/abs/2606.02211
作者: Sohaib Imran,Prakhar Gupta,Jannes Elstner,David Demitri Africa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are often influenced by extraneous input features, such as cues revealing a user’s preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model’s tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.
[NLP-41] Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents
【速读】: 该论文旨在解决大语言模型代理(LLM agents)在文本基准测试中虽表现优异但推理成本过高,而现有轻量级神经重排序器(neural rerankers)通常需为每个环境单独训练与维护的问题。核心挑战在于如何构建一个通用性强、可跨多个异构环境(如ALFWorld、WebShop、ScienceWorld)执行动作选择的轻量级统一模型,从而避免针对不同环境分别部署和维护专用模型的开销。其解决方案的关键在于通过联合训练(joint training)三类环境数据,并采用少数类样本上采样(minority-class upsampling)实现数据分布均衡,显著提升了模型在多环境下的泛化能力;实验表明,三环境联合训练在保持竞争性单环境性能的同时,实现了平均+0.551的净收益,且具备高度样本效率——仅使用9.2%的目标域数据即可恢复93%的全量数据性能,凸显数据多样性是提升跨域迁移能力的核心因素。此外,引入环境感知的LoRA适配器路由(environment-aware LoRA adapter routing)结合PCGrad优化策略,在最优种子下达到+0.611的性能增益,尽管存在较高的随机性波动,但仍展现出极具潜力的跨环境泛化方向。整体而言,数据清洗与再平衡的联合训练范式是实现高性能、低成本、可扩展的通用动作选择模型的关键。
链接: https://arxiv.org/abs/2606.02204
作者: Kan Shao
机构: Jinglue Technology Development (Nanjing) Co., Ltd.(南京景略科技发展有限公司)
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 6 tables
Abstract:Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would eliminate per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611 (seed 42), with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.
[NLP-42] CRAFTQA: A Code-Driven Adaptive Framework for Complex Structured Data Reasoning ACL2026
【速读】: 该论文旨在解决统一结构化数据问答(Unified Structured Data Question Answering)中现有方法依赖预定义函数集所导致的复杂推理能力受限问题。其核心挑战在于,传统方法无法灵活应对超出预设操作范围的复杂逻辑推理任务。为克服这一局限,论文提出CRAFTQA框架,其关键创新在于构建了一个自适应的代码驱动机制:通过CodeSTEP模块生成完整的可执行Python代码序列,实现基于步骤的代码化推理;同时引入CRAFT模块,动态生成针对非预定义操作的定制化代码函数,并与CodeSTEP无缝集成,显著提升了系统在复杂推理场景下的灵活性与泛化能力。实验结果表明,该框架在多个结构化数据集上均显著优于现有统一方法。
链接: https://arxiv.org/abs/2606.02170
作者: Chengtao Gan,Zhiqiang Liu,Long Jin,Yushan Zhu,Lei Liang,Wen Zhang
机构: Zhejiang University (浙江大学); Ant Group (蚂蚁集团); JIUTIAN Research, Beijing, China (九天研究院,北京,中国); ZJU-Ant Group Joint Lab of Knowledge Graph (浙江大学-蚂蚁集团知识图谱联合实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by Findings of ACL 2026
Abstract:Real-world scenarios involve massive heterogeneous structured data (e.g., tables, knowledge graphs), making effective reasoning over such diverse data increasingly important. Unified structured data question answering has emerged as a prominent research trend, aiming to answer natural language questions across different structured data types within a single framework. However, existing unified methods share a common limitation: they rely on a set of predefined functions, which restricts their ability to perform complex reasoning beyond these predefined operations. To overcome this fundamental limitation, we propose CRAFTQA, a novel adaptive code-driven framework comprising two core modules, CodeSTEP and CRAFT. The CodeSTEP module is a paradigm that generates a complete executable Python code sequence, which contains step-by-step code-based reasoning operations based on the question. The CRAFT module dynamically generates custom code functions for operations beyond the predefined function set, and seamlessly integrates with CodeSTEP to significantly enhance flexibility in handling complex reasoning. Comprehensive experiments on multiple structured datasets demonstrate that CRAFTQA achieves remarkable improvements in complex reasoning scenarios compared to existing unified methods.
[NLP-43] InfoMerge: Information-aware Token Compression for Efficient Video Large Language Models
【速读】: 该论文旨在解决视频大语言模型(Video-LLM)在视频理解任务中因视觉标记(visual tokens)数量过多而导致的高计算开销问题。现有无训练压缩方法虽能通过减少视觉标记提升推理效率,但其依赖局部相邻帧相似性进行时序冗余估计,或仅根据片段长度分配标记预算,易受帧级噪声干扰,且无法有效捕捉真实视频中非均匀的信息分布特性。为此,本文提出一种无需训练的视觉标记压缩方法InfoMerge,其核心在于通过鲁棒的冗余估计与内容感知的预算分配机制实现更高效的标记利用。关键创新包括:提出时序指纹差异(Temporal Fingerprint Difference),一种基于段内相同空间位置上标记时序结构的二阶冗余估计策略,以更准确地建模跨时间维度的冗余关系;引入内容感知预算分配(Content-Aware Budget Allocation, CABA),依据片段独特性及基于谱熵的表征丰富度动态分配段级标记预算,从而优先保留信息丰富的区域并抑制对静态冗余区域的重复编码。实验表明,InfoMerge在多个基准和骨干网络上均实现了优异的效率-精度权衡,在极端压缩条件下优势更为显著:以LLaVA-OneVision-7B为例,仅保留85%的视觉标记即维持98.8%的原始平均性能,并在预填充阶段实现4.24倍加速。
链接: https://arxiv.org/abs/2606.02161
作者: Xinxin Liu,Shiwei Gan,Xiao Liu,Yafeng Yin,Lei Xie,Sanglu Lu
机构: State Key Laboratory of Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 15 pages, 8 figures
Abstract:Video Large Language Models (Video-LLMs) achieve strong performance in video understanding, but their excessive visual tokens bring substantial computational overhead. Existing training-free compression methods improve inference efficiency by reducing visual tokens, yet they often rely on local adjacent-frame similarity for temporal redundancy estimation or allocate token budgets mainly according to segment length. Such designs are sensitive to frame-level noise and fail to capture the non-uniform information distribution of real-world videos. To address these challenges, we propose InfoMerge, a training-free visual token compression method that improves token utilization through robust redundancy estimation and content-aware budget allocation. Specifically, we propose the Temporal Fingerprint Difference: a segment-level second-order temporal redundancy estimation strategy, which models the temporal similarity structure of tokens at the same spatial positions within each segment. We further introduce Content-Aware Budget Allocation (CABA), which dynamically allocates segment-level token budgets based on segment uniqueness and spectral-entropy-based representational richness. By reducing repeated preservation of redundant static regions and allocating more tokens to informative segments, InfoMerge makes better use of the limited token budget while maintaining strong performance. Extensive experiments show that InfoMerge achieves strong efficiency–accuracy trade-offs across multiple benchmarks and backbones, with more pronounced advantages under aggressive compression. On LLaVA-OneVision-7B, InfoMerge retains 98.8% of the original average performance while reducing 85% of visual tokens and achieving a 4.24-fold speedup in the prefill stage.
[NLP-44] On the Salience of Low-Probability Tokens for AI-Generated Text Detection: A Multiscale Uncertainty Perspective ICML2026
【速读】: 该论文旨在解决生成式人工智能(Generative AI)文本与人类写作日益融合所带来的识别难题,尤其关注由此引发的虚假信息传播、学术滥用及语料污染等实际风险。现有基于统计的检测方法虽具备高效性和泛化能力,但存在两大关键缺陷:其一,模板化内容(boilerplate)主导,即在人类与大语言模型(LLM)文本中广泛共享的高频率词汇掩盖了具有判别意义的信号;其二,脆弱的点估计,依赖单一概率分数的决策在对抗性篡改下表现不稳定。为克服上述问题,本文提出Uncertainty——一种多尺度不确定性估计算法,其核心在于聚焦于低概率但信息量高的词元(tokens),从而更清晰地揭示分布差异。局部层面,通过平均低概率词元的对数概率来缓解模板化内容的干扰;全局层面,则利用Rényi熵捕捉低概率区域的概率分布形态,以增强鲁棒性。进一步地,通过条件独立采样机制扩展为Uncertainty++,实现了更为稳定的不确定性估计。在七个数据集和十六个LLM上的实验验证了该方法在有效性、泛化性与抗攻击性方面的显著优势。
链接: https://arxiv.org/abs/2606.02158
作者: Yikai Guo,Bin Wang,Xilai Fan,Wenjun Ke,Haoran Luo
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2026 main conference
Abstract:AI-generated text increasingly blends with human writing, raising practical risks such as misinformation, academic misuse, and corpora contamination. While statistical detectors are appealing for efficiency and generalization, they suffer from two key limitations. (i) Boilerplate dominance, boilerplate tokens shared across human and LLM writing can overwhelm discriminative signals. (ii) Brittle point estimates, relying on a single probability score yields unstable decisions under adversarial manipulations. To address these issues, we propose Uncertainty, a multiscale uncertainty estimator that focuses on informative low-probability tokens, which more clearly expose distributional discrepancies. Locally, it alleviates boilerplate dominance by averaging the log-probabilities of low-probability tokens; globally, it reduces brittleness by capturing the distributional shape of this low-probability region via Rényi entropy. We further extend the detector to Uncertainty++ via conditional independent sampling, yielding a more stable uncertainty estimation. Experiments across seven datasets and sixteen LLMs demonstrate high effectiveness, generalization, and robustness. Our code is available at this https URL.
[NLP-45] Multilingual Idioms in Sentences and Conversations Across High- Medium- and Low-Resource Languages
【速读】: 该论文旨在解决多语言自然语言处理(Multilingual NLP)中习语理解的难题,尤其关注习语在字面义与隐喻义之间意义转换所引发的语境依赖性问题。现有研究多集中于高资源语言,且仅评估孤立的习语-语义匹配任务,忽视了真实语篇中的上下文复杂性。为此,论文提出了MIDI数据集,涵盖3种高资源、3种中等资源及12种低资源语言,由母语者精心构建,包含嵌入句级和对话语境中的习语实例,能够同时捕捉习语的字面义与隐喻义。实验表明,当前主流模型在低资源语言上的习语理解性能显著下降,且所有资源层级下,字面义理解均比隐喻义更困难;尽管对话上下文有助于提升表现,但无法消除这一差距。通过控制实验与对隐藏表示的干预分析,研究进一步区分了模型的记忆化行为与推理能力,揭示了现有模型在语义泛化与深层语境理解方面的根本局限。
链接: https://arxiv.org/abs/2606.02147
作者: Saeed Almheiri,Bilal Elbouardi,Salsabila Zahirah Pranida,Irina Nikishina,Ashwath Rao B,Parameswari Krishnamurthy,Muhammad Cendekia Airlangga,Rifo Ahmad Genadi,Nguyen Phan Gia Bao,Amir Hossein Yari,Hawau Olamide Toyin,Nurdaulet Mukhituly,Mena Attia,Besher Hassan,Ahmad Fathan Hidayatullah,Tatsuki Kuribayashi,Haonan Li,Suma Bhat,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); University of Hamburg (汉堡大学); Manipal University (曼帕尔大学); IIIT Hyderabad (印度国际信息技术研究所海得拉巴分校); University of Science and Technology of Hanoi (河内科学技术大学); Universitas Islam Indonesia (印尼伊斯兰大学); Princeton University (普林斯顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Idiomatic expressions pose a major challenge for multilingual NLP because their meanings shift between figurative and literal usage, often requiring context for accurate interpretation. Prior work has focused on high-resource languages typically evaluates isolated idiom-meaning questions, overlooking realistic discourse. We introduce MIDI, a multilingual idiom dataset spanning 3 high-, 3 medium-, and 12 low-resource languages, curated by native speakers. Unlike previous datasets, MIDI provides idioms embedded in both sentence-level and conversational contexts, capturing both literal and figurative readings. Benchmarking state-of-the-art models shows that idiom comprehension degrades in low-resource languages and that, in all resource tiers, literal interpretations are substantially harder than figurative ones. Conversational context improves performance but does not eliminate these disparities. Through controlled tests and interventions on hidden representations, we further separate memorization from reasoning, exposing core limitations of current models.
[NLP-46] A Primer in Post-Training Reasoning Data: What We Know About How It Works
【速读】: 该论文旨在解决大模型后训练(post-training)阶段中推理数据(reasoning data)研究碎片化的问题,系统性地整合了超过150篇公开的研究文献与系统报告,以厘清后训练推理数据的现状与发展方向。其核心挑战在于:尽管推理数据是决定后训练成功与否的关键变量,但相关研究分散于数据集论文、强化学习方案、奖励模型研究、评估基准及前沿系统报告之中,缺乏统一的归纳与框架。本文的关键解决方案是构建一个四维组织框架,围绕“存在哪些数据对象”、“何种特性使其有效”、“如何构建”以及“如何实现规模化”四个核心问题,对现有研究进行系统梳理与整合,从而为未来推理数据的发布和后训练方法的设计提供可追溯、可复现的归因范式(attribution framework)。
链接: https://arxiv.org/abs/2606.02113
作者: Yaoming Li,Guangxiang Zhao,Qilong Shi,Lin Sun,Xiangzheng Zhang,Tong Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 22 pages. Project Repository: this https URL
Abstract:Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoning-data releases and post-training recipes.
[NLP-47] Jailbreaking Multimodal Large Language Models using Multi-Clip Video ACL2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理视频输入时可能遭受恶意攻击的安全隐患问题,特别是针对通过视觉输入绕过安全对齐机制的漏洞。现有研究已表明,图像输入可被用于“越狱”(jailbreak),但尚不清楚视频输入中哪些具体属性会诱发此类脆弱性。为此,作者提出了Multi-Clip Video (MCV) SafetyBench数据集,包含2,920个视频,每个视频由多个短片段组成,涵盖与有害查询相关的多样化语境,以系统评估视频输入多样性对MLLMs脆弱性的影响。实验结果表明,随着视频中片段数量的增加,攻击成功率显著上升;同时发现,相较于静态图像,视频模态更具脆弱性,动态视频比静态视频更易被利用,且包含更多样化语境的视频威胁程度更高。基于上述发现,论文提出一种防御策略,其核心在于利用图像模态相对更高的鲁棒性,通过融合或优先依赖图像特征来增强模型整体安全性。
链接: https://arxiv.org/abs/2606.02111
作者: Choongwon Kang,Seungjong Sun,Hyunmin Jun,Jang Hyun Kim
机构: Sungkyunkwan University(成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 27 pages, 20 figures, Accepted to the Main Conference of ACL 2026
Abstract:As multimodal large language models (MLLMs) have advanced to process video inputs, concerns have emerged about their potential for malicious misuse. Prior jailbreak studies have shown that safety alignment in MLLMs can be bypassed through visual inputs, yet it remains unclear which properties of video inputs induce this vulnerability. To address this gap, we introduce Multi-Clip Video (MCV) SafetyBench, a dataset of 2,920 videos designed to evaluate how the diversity of video inputs affects the vulnerability of MLLMs. Each video consists of multiple short clips depicting diverse contexts related to a harmful query. Experiments on eight representative video MLLMs show that attack success consistently increases with the number of clips. Our results further indicate that the video modality is (1) more vulnerable than the image modality, (2) more vulnerable to dynamic videos than to static videos, and (3) more vulnerable when videos contain more diverse contexts. Building on these findings, we propose a defense strategy that leverages the relative robustness of the image modality.
[NLP-48] PortBERT: Navigating the Depths of Portuguese Language Models
【速读】: 该论文旨在解决葡萄牙语自然语言处理(Natural Language Processing, NLP)领域中高效且语言特定的模型稀缺的问题,尤其针对现有研究过度关注模型规模或准确率而忽视训练与部署效率的现状。其核心解决方案在于提出PortBERT,一个基于RoBERTa架构、专为葡萄牙语设计的语言模型家族,通过在超过450 GB的去重和过滤后的mC4与OSCAR23数据集(来自CulturaX)上从头训练,并采用字节级子词分词(Byte-level BPE)与跨GPU和TPU平台稳定的预训练流程,在性能与计算效率之间实现良好平衡。该工作不仅在ExtraGLUE基准上验证了模型的竞争力,还系统报告了训练时间、推理延迟及微调吞吐量等关键效率指标,从而填补了葡萄牙语NLP中关于计算-性能权衡(compute-performance tradeoffs)研究的空白。所有模型均已开源至Hugging Face,并提供fairseq检查点以支持后续研究与应用。
链接: https://arxiv.org/abs/2606.02100
作者: Raphael Scheible-Schmitt,Henry He,Armando B. Mendes
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.
[NLP-49] he Role of Ambiguity in Error Prediction via Uncertainty Quantification
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在错误预测中因输入不确定性(aleatoric uncertainty)干扰而导致的性能下降问题。现有基于不确定性量化(Uncertainty Quantification, UQ)的方法虽能捕捉模型知识或能力不足时的置信度缺失,但其信号同时受到输入本身固有模糊性的影响,从而削弱了对真实错误的预测能力。为此,论文提出一种关键解决方案:通过解耦输入模糊性与UQ信号,提升错误预测的准确性。具体而言,采用门控专家(Gated Experts)与选择性预测(Selective Prediction)机制,将真实标签和预测的模糊性信息引入错误预测流程。实验结果表明,该方法显著提升了多种模型架构、训练与评估范式、数据集及不同来源的随机性不确定性下的错误预测性能,在标准数据集上使单个UQ指标的正确召回率(PRR)提升超过10个百分点,证明了模糊性信息对增强错误预测的有效性。
链接: https://arxiv.org/abs/2606.02093
作者: Ieva Raminta Staliūnaitė,James Bishop,Andreas Vlachos
机构: University of Cambridge (剑桥大学); The Alan Turing Institute (艾伦·图灵研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages not including references and appendices, 3 figures
Abstract:The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.
[NLP-50] DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)在大语言模型(LLM)推理过程中因块扩散推测解码(Block Diffusion Speculative Decoding)所面临的效率瓶颈问题,即现有方法DFlash受限于所有草稿层共享单一融合表示,导致每层表达能力不足,难以进一步扩展草稿模型容量。其解决方案的关键在于提出一种轻量级的逐层融合机制——\modelname,通过让每个草稿层独立关注一组广泛的目标模型层的可学习组合,突破了DFlash中狭窄的条件信息瓶颈。该机制在几乎无额外计算开销的前提下,显著增强了每层对目标模型内部知识的注入能力,并赋予各草稿层差异化输入,从而提升了整体表达能力。结合训练数据规模从80万增至240万样本,实现了更深层次草稿模型的稳定性能提升。在涵盖数学推理、代码生成与对话任务的六个基准测试中,\modelname 在Qwen3-4B、Qwen3-8B 和 GPT-OSS-20B 上分别实现了平均5.52倍、5.46倍和3.91倍的时钟速度加速,相较DFlash分别提升约11%、8%和5%。
链接: https://arxiv.org/abs/2606.02091
作者: Jiebin Zhang,Zhenghan Yu,Song Liu,Eugene J.Yu,Zheng Li,Dawei Zhu,Jiangshan Duo,Weimin Xiong,Yifan Song,Guanghua Yu,Jianchen Zhu,Sujian Li
机构: Peking University (北京大学); Tencent (腾讯)
类目: Computation and Language (cs.CL)
备注: 12 pages, 3 figures
Abstract:Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model’s internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11%, 8%, and 5% respectively. Our code is available at this https URL.
[NLP-51] SentGuard: Sentence-Level Streaming Guardrails for Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实时流式生成长篇推理型输出时,如何在生成过程中及时、准确地进行安全干预的问题。现有防护机制存在两大局限:响应级别(response-level)方法延迟至完整输出生成后才进行干预,导致风险暴露时间过长;而令牌级别(token-level)方法则基于不完整的语义信息做出判断,易引发决策不稳定及过度触发防护机制。为此,论文提出SentGuard——一种与生成过程并行运行的句子级别(sentence-level)流式防护机制。其核心创新在于引入轻量级等待缓冲区(waiting buffer),将流式输出的令牌按句子片段分组,并仅在经验证后释放给用户,通过引入微小延迟使系统能够在目标模型继续解码的同时评估当前前缀的潜在风险。为支持该机制,研究构建了StreamSafe基准数据集,涵盖8类危害性内容的结构化逐句标注,用于捕捉安全风险在推理与回复段落中的演化过程。此外,采用从粗到细(coarse-to-fine)的目标函数训练SentGuard,以尽早识别句子边界处的不当意图。实验结果表明,SentGuard在5个安全基准上表现优异,在两句话内检测到90.5%的不安全情形,同时保持7.41%的低流式误报率,显著优于现有基线方法。
链接: https://arxiv.org/abs/2606.02041
作者: Jiaqi Yu,Xin Wang,Yixu Wang,Jie Li,Yan Teng,Xingjun Ma,Yingchun Wang
机构: Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures, submitted to ARR
Abstract:Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas token-level methods act on incomplete semantics, often producing unstable decisions and excessive guard invocations. To address this challenge, we propose SentGuard, a sentence-level streaming guardrail that operates in parallel with generation. A lightweight waiting buffer groups streamed tokens into sentence chunks and releases only verified chunks to the user, introducing a small offset that enables SentGuard to assess the current prefix while the target LLM decodes subsequent content. To support this, we construct StreamSafe, a benchmark with structured per-sentence annotations across 8 harm categories, capturing the evolution of safety risks across both reasoning and response segments. We further train SentGuard with a coarse-to-fine objective to detect unsafe intent as soon as it emerges at sentence boundaries. Experiments on 5 safety benchmarks show that SentGuard outperforms existing baselines, detecting 90.5% of unsafe cases within two sentences while maintaining a low streaming false-positive rate of 7.41%.
[NLP-52] OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents
【速读】: 该论文旨在解决开放环境下视觉网页智能体(visual web agents)训练中长期推理能力不足、精准定位困难以及与动态真实网站交互鲁棒性差的核心问题。现有高性能系统多为专有模型,而开源方案则严重依赖大规模人工标注的高质量轨迹进行监督后训练,导致数据收集成本高昂且难以覆盖开放网络的多样性与动态变化。针对这一瓶颈,本文提出OpenWebRL——一个面向真实网站的在线多轮强化学习(online multi-turn RL)开源训练框架,其关键创新在于构建了完整的端到端训练体系:包括可扩展的实时浏览器基础设施、基于少量轨迹的监督初始化、多模态上下文管理、基于轨迹级别的成功判定机制以及高效的多轮策略优化方法。通过该框架训练的OpenWebRL-4B模型仅需0.4K初始化轨迹和2.2K开放式强化学习任务,便在Online-Mind2Web和DeepShop等挑战性基准上分别达到67.0%和64.0%的成功率,显著超越同规模或更大体量的开源模型,并媲美闭源系统如OpenAI CUA和Gemini CUA。研究进一步系统分析了在线强化学习在视觉网页智能体中的有效设计要素,揭示了强化学习如何提升智能体的自主推理能力。本工作为构建更强大、可复现且成本更低的开放网页智能体提供了可行路径,并将公开全部训练数据、模型与代码以推动后续研究。
链接: https://arxiv.org/abs/2606.02031
作者: Rui Yang,Qianhui Wu,Yuxi Chen,Hao Bai,Wenlin Yao,Hao Cheng,Baolin Peng,Huan Zhang,Tong Zhang,Jianfeng Gao
机构: UIUC(伊利诺伊大学厄本那-香槟分校); Microsoft(微软)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: 36 pages, 11 figures
Abstract:Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.
[NLP-53] Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning ICML2026
【速读】: 该论文旨在解决大模型在链式思维(Chain-of-Thought, CoT)推理过程中存在的计算效率低下与推理可靠性不足的问题。其核心挑战在于:尽管模型在推理过程中逐渐收敛至正确答案,但往往持续生成大量冗余且无意义的中间步骤,导致不必要的计算开销;同时,缺乏有效的机制来识别推理过程中的稳定可靠阶段,难以实现高效终止或动态优化。解决方案的关键在于揭示了CoT推理中普遍存在的“双阶段”熵动态结构——即从探索性不确定区(Uncertainty Region)向收敛性高置信区(Confidence Region)的突变过渡,并发现该高置信区具备两个关键性质:高可靠性(High Reliability),即答案趋于准确且稳定;高冗余性(High Redundancy),即模型在已得正确答案后仍持续生成无关文本。基于此,论文提出两种新型推理优化策略:1)早期退出(Early Exit)利用可靠性与冗余性信号,在输出收益下降时安全终止计算;2)测试时缩放(Test-Time Scaling)通过识别收敛轨迹优先选择高质量推理路径。为实现上述策略,论文首次将置信区检测建模为序列变化点检测(sequential change-point detection)问题,并采用统计最优的累积和(CUSUM)算法构建无需训练的实时推理控制框架。实验表明,该方法在早期退出任务中显著优于现有基准,达到63.06%准确率的同时减少11.1%的生成令牌,相较DEER和Dynasor分别提升3.28%和4.36%的准确率;在测试时缩放任务中,基于CUSUM加权的投票机制亦持续优于自一致性(self-consistency)方法。
链接: https://arxiv.org/abs/2606.02020
作者: Ting Xu,Xu He,Yupu Lu,Jiankai Sun,Dong Li,Wai Lam,Jianye Hao
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 10 figures, accepted in ICML2026
Abstract:This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability – answers in the confidence region become highly accurate and stable, and 2) High Redundancy – models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.
[NLP-54] PlanarBench: Evaluating LLM Spatial Reasoning via Planar Graph Drawing
【速读】: 该论文旨在解决大语言模型(LLM)在缺乏显式空间结构信息的情况下,对平面图(planar graph)进行可视化生成的挑战,即仅通过边列表(edge list)生成对应的ASCII艺术图,这是一项依赖于空间推理能力的任务。传统图基准测试通常仅以节点数量作为难度指标,而本文提出一个更精细的评估框架——PlanarBench,其核心创新在于揭示边数(edge count)是决定任务难度的主导因素(相关系数 r = -0.85),显著优于以往仅依赖节点数的评估方式。解决方案的关键在于构建了一个包含199个最简非同构连通平面图(2–7个顶点)的基准数据集,并设计了具有高度可变性与不可预测性的测试场景,其中边顺序、边方向及节点标签均可任意置换,从而有效抑制模型对训练样本的过拟合或记忆行为,迫使模型真正具备基于抽象拓扑关系进行空间重构的能力。
链接: https://arxiv.org/abs/2606.02010
作者: Oleksandr Nikitin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, this https URL
Abstract:PlanarBench tests whether LLMs can draw planar graphs as ASCII art given only an edge list – a spatial reasoning task that resists memorization because edge order, edge orientation, and node labels are all permutable. We evaluate 91 models on the 199 simplest non-isomorphic connected planar graphs (2 - 7 vertices). Edge count is the dominant difficulty predictor ( r = -0.85 ) – a finding not reported in prior LLM graph benchmarks, which use only node count as the difficulty axis. Comments: 12 pages, 4 figures, this https URL Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.02010 [cs.CL] (or arXiv:2606.02010v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.02010 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-55] Automated Essay Scoring and Language Certification: Assessing Generalizability Agreement and Validity for French
【速读】: 该论文旨在解决自动作文评分(AES)领域中存在的评价实践过于简化的问题,其核心矛盾在于当前基准测试方法倾向于采用单一指标评估模型性能,与更全面的评估框架(如基于论证的验证框架,ABV)所倡导的多维度评估理念相悖,尤其在高风险语言测试场景下更为突出。为应对这一问题,论文提出了一种改进且更具实践性的ABV框架,其关键创新在于整合了公平性分析、与语言特征的相关性分析、预测误差评估以及模型评分结果与人工评分者的一致性比较等多维评估维度。通过在法语AES任务上对8种模型架构进行系统评估,利用包含2.7万篇考试作文(每篇由2名评分者打分)和961篇泛化样本(至少9名评分者打分)的数据集,实证表明该增强版框架不仅有助于深入理解现有AES模型的优势与局限,还推动了法语自动作文评分技术的最新进展。
链接: https://arxiv.org/abs/2606.02009
作者: Rodrigo Wilkens,Rémi Cardon,Vincent Folny,Thomas François
机构: University of Exeter; France Éducation international; Cental, ILC, UCLouvain; Computer Science and Engineering Department, Universidad Carlos III de Madrid
类目: Computation and Language (cs.CL)
备注:
Abstract:In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.
[NLP-56] Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling
【速读】: 该论文旨在解决消费价格测量中利用替代数据源(如扫描数据、网络爬取数据及交易/收据数据)时面临的商品描述短小、噪声大、缩写严重且无标准产品编码的问题,核心挑战在于如何将这些非标准化的商品条目准确映射至统一的消费分类体系(如联合国COICOP分类)。其解决方案的关键在于构建一个通用、可复现的自动化映射流水线:首先对噪声文本进行规范化与分词;其次采用基于前缀树(trie)的规则预分类器,通过各品类特有的关键词与停用词驱动分类;最后引入针对每个品类的二元确认模型,判断候选分类的合理性。为实现大规模标注,设计了人机协同标注协议,通过动态更新的可靠性权重聚合人工标注结果,并结合规则系统实现持续优化。实证研究表明,该任务在理想条件下已趋于饱和——基于词袋模型的线性分类器即可达到约0.99的F1分数,高阶特征(如n-gram)无增益,仅需约67个标注样本即达性能上限;蒙特卡洛模拟显示,加权投票策略虽略优于简单多数投票,但Dawid-Skene模型在标签校正方面表现显著更优。研究还为统计机构在引入交易数据时提供了价格水平质量控制与系统设计的重要启示。
链接: https://arxiv.org/abs/2606.02004
作者: Vladimir Beskorovainyi
机构: Besk Tech(贝斯科技); Moscow Institute of Physics and Technology (莫斯科物理技术研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages, 3 tables. Methodology paper; illustrative experiments only, no proprietary data
Abstract:Consumer-price measurement increasingly draws on alternative data sources – scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a consumption classification (e.g., the UN COICOP scheme) before prices can be compared. This paper studies that mapping as a general, reproducible method. The pipeline is: (i) text normalization and tokenization of noisy item names; (ii) a prefix-tree (trie) rule-based pre-classifier driven by per-category key-phrases and stop-phrases; and (iii) a per-category binary confirmation model deciding whether an item belongs to a tentatively assigned category. For labels at scale we use a human-in-the-loop protocol in which annotators give a binary valid/reject judgment, aggregated by a dynamically updated reliability weight; the model joins the same rule, enabling continual fine-tuning. Our empirical finding is deflationary: in a controlled, leakage-free study (one category, real positives vs. hard negatives, five seeds), bag-of-words models essentially saturate the task (F1 about 0.99) – a linear classifier matches a multilayer perceptron, explicit word-order (n-gram) features add nothing, and about 67 labeled examples already suffice. A Monte-Carlo study of the labeling protocol shows the reliability-weighted vote barely beats plain majority (its additive weights saturate) while Dawid-Skene recovers labels markedly better. We also discuss price-level quality control and design lessons for statistical offices considering transaction data. All figures are illustrative; no confidential data, code, or documentation is reproduced.
[NLP-57] Scaling Agent ic Capabilities via Grounded Interaction Synthesis
【速读】: 该论文旨在解决当前生成式智能体(agentic intelligence)在构建复杂任务与多样化环境时,因依赖大语言模型(LLM)自动生成交互数据而产生的偏差与低效问题。现有方法完全依赖LLM进行无约束生成,导致合成数据往往局限于模型内部先验,难以体现真实世界任务的多样性与复杂性,尤其在长时程、高难度任务的建模上表现不足。其解决方案的关键在于提出一种基于双重接地机制的自动化框架——基于真实世界模型上下文协议(Model Context Protocol, MCP)服务器构建协议锚定环境,以确保环境功能的多样性和实际挑战性;并采用结构引导规划策略,通过主动引入逻辑依赖关系和对抗性策略,驱动智能体在环境中完成复杂、高保真任务的生成。实验表明,GAIS生成的数据在BFCL、τ²-Bench和ACEBench等基准上显著优于现有最先进基线,使基础模型性能达到甚至超越官方指令微调版本,且展现出更强的数据效率与可扩展性,在数据量远低于基线的情况下仍能持续提升能力,而后者则趋于停滞。
链接: https://arxiv.org/abs/2606.02001
作者: Wenhang Shi,Jinhao Dong,Yiren Chen,Zhe Zhao,Shuqing Bian,Wei Lu,Xiaoyong Du
机构: Renmin University of China(中国人民大学); Peking University(北京大学); Tencent(腾讯)
类目: Computation and Language (cs.CL)
备注:
Abstract:General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Models (LLMs) to scale the synthesis of agentic environments and tasks. However, such unconstrained generation often degenerates into biased random sampling of LLMs’ internal priors, failing to capture the diversity and difficulty of real-world domains or construct high-fidelity, long-horizon tasks. In this work, we introduce Grounded Agentic Interaction Synthesis (GAIS), a framework that automates the scalable construction of diverse environments and complex tasks via a two-phase grounding mechanism. Specifically, we construct protocol-anchored environments derived from real-world Model Context Protocol (MCP) servers to ensure functional diversity and difficulty. Subsequently, we employ structure-guided planning to navigate these environments, actively enforcing logical dependencies and adversarial policies to generate complex tasks. Experiments on BFCL, \tau^2 -Bench, and ACEBench demonstrate that GAIS-synthesized data significantly outperforms state-of-the-art baselines, enabling base models to match or even surpass their official instruction-tuned counterparts. Furthermore, GAIS exhibits superior data efficiency and scalability, achieving exceptional capabilities with significantly less data while maintaining continuous growth where baselines stagnate. Our code and dataset are publicly available at this https URL.
[NLP-58] CARTE: A Benchmark for Mapping Language Model Knowledge Across France
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理法国境内地理与区域差异化知识时的细粒度推理能力不足的问题。现有基准测试多聚焦于国家层面的文化理解,忽视了同一国家内部不同区域之间的细微差异,尤其是相邻区域间在文化、语言、经济、环境等维度上的复杂区分需求。为弥补这一空白,研究提出CARTE 1(Culturally Anchored Regional-Territorial Evaluation),这是一个包含2,431道题目、覆盖法国13个本土大区及14个主题领域的多项选择型评估基准,涵盖文化、语言、人口、经济、环境与交通等多个方面。此外,研究还构建了子集CARTE-LV,专门针对法语区域间的语言变异进行评估。实验对27个参数规模从1B到12B的LLMs在少样本(few-shot)设置下进行评测,结果揭示了模型在不同地区表现存在显著差异,且模型规模与区域适应性之间存在系统性差距,表明当前预训练数据在地理覆盖上存在偏倚,且模型对国内区域差异的鲁棒性有限。因此,解决方案的关键在于构建一个高分辨率、区域锚定的评估框架,以精准衡量和推动模型在跨区域知识推理方面的改进。
链接: https://arxiv.org/abs/2606.01995
作者: Sarah Almeida Carneiro(X),Christos Xypolopoulos(X, NTUA),Xiao Fei(X),Yang Zhang(X),Michalis Vazirgiannis(X, MBZUAI)
机构: École Polytechnique, Institut Polytechnique de Paris(法国巴黎综合理工学院); National Technical University of Athens(雅典国立技术大学); Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmarks focus on national-level cultural understanding, they largely overlook intra-country variation and the need to distinguish between closely related regional contexts. CARTE addresses this gap by introducing 2,431 questions spanning the 13 metropolitan regions of France and covering 14 thematic domains, including culture, language, demographics, economy, environment, and mobility. We further introduce CARTE-LV, a subset targeting Linguistic Variation across French regions, enabling focused evaluation of language-related differences. We evaluate 27 LLMs ranging from 1B to 12B parameters under few-shot settings. Our experiments reveal performance disparities across regions and model scales, suggesting systematic gaps in pretraining coverage and limited robustness to intra-national variation.
[NLP-59] MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?
【速读】: 该论文旨在解决从网络上获取的、以人类为导向的程序性知识(procedural knowledge)难以直接被智能体(agent)使用的问题。这类知识通常具有多模态、异构性、噪声干扰以及隐含人类执行者等特性,与代理所需的形式化可执行技能之间存在显著鸿沟。为弥合这一差距,论文提出“指南到技能学习”(guide-to-skill learning)问题,其核心在于将现实世界中的非结构化指南转化为可编辑、可执行的技能,并基于智能体在任务执行过程中可观测的轨迹反馈持续优化这些技能。解决方案的关键在于构建一个闭环框架——MMG2Skill,该框架通过结构化地编译指南生成可编辑技能,利用固定视觉-语言模型(VLM)代理在执行中依赖这些技能,并基于轨迹层面的根因反馈进行技能修订,而无需依赖基准分数。实验表明,在六种不同VLM骨干模型上,该方法在GUI控制、开放式游戏和策略卡牌游戏中均显著优于基线代理,平均性能提升达+12.8至+25.3个百分点。消融实验进一步验证了仅直接对原始指南进行提示会损害性能,而结构化的技能构建与基于轨迹的迭代修正共同构成了性能提升的核心要素;在成功信号可推断的任务中,基于分析器的早期终止机制还能有效防止后期性能退化,并在成功信号校准得当时减少25%–53%的无效尝试。
链接: https://arxiv.org/abs/2606.01993
作者: Xinyu Che,Junqi Xiong,Yunfei Ge,Xinping Lei,Shihao Li,Hang Yan,Han Li,Yuanxing Zhang,Zhiqi Bai,Jinhua Hao,Ming Sun,Han Li,Jiaheng Liu
机构: Nanjing University; Kuaishou Technology
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 35 pages, 12 figures, 13 tables. Code: this https URL
Abstract:Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.
[NLP-60] SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning ACL2026
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在复杂环境中通过模型上下文协议(Model Context Protocol, MCP)扩展动作空间所引发的安全风险,尤其是因能力过度扩张而产生的权力寻求行为。其核心问题是:尽管更广阔的动作空间有助于任务完成,但同时也放大了微小错误或幻觉导致灾难性失败的可能性,形成脆弱的风险面。解决方案的关键在于提出SafeMCP——一种基于服务器端的防御插件,通过预测未来安全风险的前瞻性推理机制,对工具获取进行约束。SafeMCP利用内部世界模型实现前瞻式推理,构建双层防御机制:第一层为前瞻性工具过滤,预防潜在危险的能力扩张;第二层为即时干预,作为失效保护措施。为训练SafeMCP,研究设计了三阶段流程:环境动态建模、安全策略初始化以及基于双重可验证奖励的强化学习(Reinforcement Learning, RL)。实验在PowerSeeking Bench、ToolEmu和AgentHarm基准上验证了SafeMCP能够在有效降低风险的同时保持代理的任务效能,实现安全与可用性的平衡。
链接: https://arxiv.org/abs/2606.01991
作者: Lichao Wang,Zhaoxing Ren,Tianzhuo Yang,Jiaming Ji,Chi Harold Liu,Yaodong Yang,Juntao Dai
机构: Beijing Institute of Technology (北京理工大学); Beijing Academy of Artificial Intelligence (北京人工智能研究院); Institute for Artificial Intelligence, Peking University (北京大学人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), Main Conference
Abstract:As Large Language Model (LLM) agents increasingly leverage the Model Context Protocol (MCP) to operate in complex environments, the expansion of their action spaces offers agents unsafe capabilities and underscores the risk of power-seeking. While broad action space and greater environment influence are essential for task fulfillment, they create a fragile risk surface where minor errors or hallucinations are magnified into catastrophic failures. In response, we propose SafeMCP, a server-side defense plugin that constrains tool acquisition via predictive reasoning regarding future safety risks. SafeMCP utilizes an internal world model for look-ahead reasoning to implement a two-tier defense: proactive tool filtering to constrain hazardous power expansion and immediate intervention as a fail-safe. To train SafeMCP, we introduce a three-stage pipeline comprising environmental dynamic grounding, safe policy initialization, and reinforcement learning (RL) with dual verifiable rewards. Experiments on PowerSeeking Bench, ToolEmu, and AgentHarm show that SafeMCP achieves a safe equilibrium, effectively mitigating risks while preserving agent utility.
[NLP-61] raining Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调阶段中训练提示(training prompts)设计不当所导致的灾难性遗忘与泛化能力不足问题。现有微调范式通常将训练提示视为表面形式,假设语义等价的指令会带来相同的模型学习效果,但研究发现这一假设具有误导性:尽管改写后的提示在任务内表现相近,却在跨任务场景下引发显著不同的遗忘与泛化行为,且这些影响在不同任务间呈正相关,表明存在一类能持续提升性能的“优越提示”。其关键突破在于发现这些优越提示可通过学习前的任务损失(task loss prior to learning)进行稳健识别。基于此,论文提出轻量级但高效的状态自适应提示优化(State-Adaptive Prompt Optimization, SAPO)策略,将任务表述从静态输入转变为动态、状态自适应变量,从而在训练过程中主动优化提示以抑制遗忘并增强泛化能力。大量实验证明,SAPO显著优于当前最优方法,在多个基准上实现显著性能提升,为理解训练提示如何塑造学习动态提供了新见解,并提供了一套可落地的鲁棒微调方案。
链接: https://arxiv.org/abs/2606.01967
作者: Wenhang Shi,Yiren Chen,Shuqing Bian,Zhe Zhao,Jinhao Dong,Pengfei Hu,Wei Lu,Xiaoyong Du
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that semantically equivalent instructions yield identical learning outcomes. However, we reveal that this equivalence is deceptive: while paraphrased prompts often lead to comparable in-task performance, they induce drastically different cross-task impacts regarding catastrophic forgetting and generalization. Crucially, these impacts are positively correlated across tasks, indicating the existence of superior prompts that consistently yield better performance. Furthermore, we discover that these superior prompts can be robustly identified by task loss prior to learning. Leveraging these insights, we introduce State-Adaptive Prompt Optimization (SAPO), a lightweight yet effective training strategy that shifts task formulation from a static input to a dynamic, state-adaptive variable. Comprehensive experiments on diverse benchmarks confirm its effectiveness, which significantly mitigates forgetting while improving generalization, achieving substantial performance gains over state-of-the-art methods. These results provide insights into how training prompts shape learning dynamics and offer a practical recipe for robust fine-tuning. Our code is available at this https URL.
[NLP-62] Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location Within-Word Landing Position and Fixation Duration in Reading
【速读】: 该论文旨在解决眼动追踪阅读数据(eye-tracking-while-reading data)在自然语言处理与认知科学研究中因数据稀缺而制约模型发展的问题。其核心挑战在于高质量眼动数据的采集成本高、耗时长,难以支撑大规模数据驱动模型的训练需求。为此,作者提出Eyettention II——一种端到端训练的轻量级深度学习模型,能够生成包含固定点位置、词内落点及注视持续时间等完整属性的逼真扫描路径(scanpath)。该模型的关键创新在于其高效性与认知合理性:在有限GPU资源下即可快速训练,并且在生成的扫描路径中准确捕捉关键心理语言学现象,如回视(regression)、词内加工模式等,从而模拟人类真实的注视行为。实验表明,Eyettention II在扫描路径预测性能上超越现有最优模型,为自然语言处理、心理语言学实验材料预研以及认知机制探索提供了可扩展、可复现的数据生成工具。
链接: https://arxiv.org/abs/2606.01964
作者: Shuwen Deng,Cui Ding,David R. Reich,Paul Prasse,Lena A. Jäger
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The way our eyes move while reading provides valuable insights into both the reader’s cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting language models and inferring a reader’s characteristics. However, these applications often rely on large-scale, data-driven models, which demand extensive eye-tracking datasets that are challenging to obtain due to the resource-intensive nature of data collection. To address the challenge of data scarcity, we develop Eyettention II, an end-to-end trained deep-learning model capable of generating realistic scanpaths consisting of a complete set of fixation attributes in chronological order, including fixation location, within-word landing position, and fixation duration. Our model is lightweight, efficiently trainable on limited GPU resources, and closely aligned with cognitive theories. We demonstrate that Eyettention II surpasses state-of-the-art models in scanpath prediction and mirrors human-like gaze behavior by capturing key psycholinguistic phenomena. With its robust performance, Eyettention II holds the potential to drive advancements in natural language processing, facilitate piloting the materials of psycholinguistic experiments, and uncover new insights beyond what is explicitly encoded in theoretical cognitive models.
[NLP-63] What to Format and How: A Benchmark and Workflow Approach for Document Formatting
【速读】: 该论文旨在解决生成式 AI 在真实场景下进行内容感知型文档格式化时面临的挑战,即如何在不依赖人工干预的前提下,准确识别文档中需格式化的语义目标并执行相应修改。现有方法普遍存在冗余读取文档内容的问题,导致效率低下且易产生错误。其解决方案的关键在于提出 DocFormFlow,一种将目标定位(what to format)与格式化操作执行(how to format)解耦的流程化方法,从而实现更高效、精准的格式化。同时,为支持真实内容感知场景下的评估,研究引入 DocFormBench 基准测试体系,涵盖多样化格式需求及兼顾准确性与效率的评价指标。实验表明,DocFormFlow 在多种大语言模型(LLM)与多模态模型上均显著提升格式化准确率并降低令牌(token)消耗,且分析揭示精确的目标定位是影响格式化性能的核心因素。
链接: https://arxiv.org/abs/2606.01936
作者: Shihao Rao,Liang Li,Jiapeng Liu,Tong Lin,Bing Li,Xiyan Gao,Peng Fu,Jing Huang,Can Ma
机构: Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所); School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation this http URL enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and this http URL mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.
[NLP-64] HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression
【速读】: 该论文旨在解决大语言模型在采用扩展式思维链(Chain-of-Thought, CoT)推理时产生的显著推理开销问题。现有CoT压缩方法普遍存在手动长度预算设置僵化、多阶段训练流程计算成本高以及仅适用于小规模模型的可扩展性瓶颈等缺陷。其解决方案的关键在于提出一种高效、单阶段的强化学习框架——混合中位数策略优化(Hybrid Median-length Policy Optimization, HMPO),通过三个协同作用的组件实现高效压缩:基于成功回放轨迹自适应生成的中位数预算,避免了人工调参;采用余弦衰减的令牌奖励机制,实现平滑的长度惩罚;以及乘法形式的奖励函数,有效抑制冗余或虚假奖励劫持行为,严格保障答案正确性优先。该方法仅在数学数据上训练,却能无缝泛化至数学、代码、科学及指令遵循等多种任务。大规模实验验证了其在从9B到122B参数量级、涵盖密集型与混合专家(Mixture-of-Experts, MoE)架构下的有效性,实现19%–46%的令牌压缩率,同时保持几乎无损的准确性,并显著降低训练成本,相较现有方法具有更强的实用性与可扩展性。
链接: https://arxiv.org/abs/2606.01934
作者: Minghui Zheng,Hongxu Chen,Huimin Ren,Hongsheng Xin,Xiaoyang Qu,Ze Wang,Shuling Yang,Ziyu Peng,Kaike Zhang,Pan Zhou,Kun Zhan
机构: Li Auto Inc. (小鹏汽车)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%–46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.
[NLP-65] Mitigating Bias in Locally Constrained Decoding via Tractable Proposals
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成内容时难以满足特定约束条件(如JSON schema)的问题,尤其是现有局部约束解码(Locally Constrained Decoding, LCD)方法因盲目屏蔽后续词元导致采样偏差及性能下降的缺陷。其解决方案的关键在于提出一种通用框架,用于构建序列蒙特卡洛(Sequential Monte Carlo, SMC)采样中的有效提议分布(proposal)与势函数(potential),以实现更高效、无偏的约束生成。具体而言,作者首先将由有限状态自动机(Finite Automata)表示的约束条件进行张量化(tensorization),从而在GPU上高效执行,并据此构建全局约束解码(Globally Constrained Decoding, GCD)提议;进一步利用张量化有限自动机与隐马尔可夫模型(Hidden Markov Model, HMM)共享相同电路结构的特性,通过电路乘法(circuit-multiplication)融合逻辑与概率信息,得到概率性全局约束解码(Probabilistic GCD, P-GCD)提议,能够同时编码目标分布的逻辑结构与概率特性。实验表明,在函数调用、关键词生成和SQL生成等任务中,相较于传统LCD提议,(P-)GCD在相同SMC设置下能以更少粒子数更快收敛至目标分布,显著提升采样效率与生成质量。
链接: https://arxiv.org/abs/2606.01926
作者: Meihua Dang,Linxin Song,Honghua Zhang,Jieyu Zhao,Guy Van den Broeck,Stefano Ermon
机构: Stanford University (斯坦福大学); Google(谷歌); UCLA (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL)
备注: 13 pages, 5 figures
Abstract:Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from p_\mathrmlm( \cdot \mid \mathrmconstraint) . First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.
[NLP-66] Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对与内部参数化记忆冲突的输入证据时出现的“上下文忽视”问题,即模型倾向于生成与外部输入事实相悖的虚假内容(事实性幻觉)。现有缓解方法多依赖于抑制特定神经元激活或采用计算开销较大的对比解码机制,往往导致困惑度上升或推理延迟显著增加。本文提出一种轻量级的推理阶段干预方法——共振上下文锚定(Resonant Context Anchoring, RCA),其核心在于从残差流信号动态的角度出发,解决外部证据在深层网络传播过程中的信号衰减问题。RCA的关键机制是将自注意力模块中的路由逻辑与信息幅度进行正交解耦:利用原始的预软最大值注意力分数作为语义对齐的瞬时度量,通过非线性整流构建动态增益场,选择性增强对应上下文标记的值向量范数,而无需改变注意力概率分布。这一设计有效提升了残差流混合中输入证据的信噪比(Signal-to-Noise Ratio, SNR),从而在推理过程中稳健地将生成轨迹锚定于真实上下文。大量实验表明,RCA在Llama-3系列模型上显著提升了多种事实一致性及强知识冲突任务中的上下文忠实性,有效抑制了参数化幻觉;同时验证了其作为无需训练、计算开销可忽略的即插即用模块,在保持模型通用语言理解能力的前提下,实现了忠实性与流畅性的帕累托改进。
链接: https://arxiv.org/abs/2606.01923
作者: Mingkuan Zhao,Yide Gao,Wentao Hu,Suquan Chen,Tianchen Huang,Zhenhua An,Zetao Chang,Xiayu Sun,Yuheng Min
机构: Xi’an Jiaotong University (西安交通大学); University of Science and Technology of China (中国科学技术大学); Tongji University (同济大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) frequently exhibit “contextual disregard” when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model’s general language understanding capabilities.
[NLP-67] Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在空间类选择题上的可靠性问题,特别是针对其在处理空间关系判断时表现出的“空间词汇偏差”(spatial lexical bias)现象。研究发现,当答案选项中引入与空间关系相关的词语时,模型倾向于被这些词汇误导,从而选择错误选项,即使其在二选一情境下能正确作答。这种“二元稳定但三元脆弱”的失败模式揭示了模型决策机制中的关键缺陷:尽管视觉信息中的正确空间关系仍被有效感知,但偏差主要源于语言模型内部对特定词汇的过度敏感。通过机制可解释性工具分析,研究识别出该偏差源自语言模型侧特定的通道与神经元。基于此发现,作者提出一种轻量级仅更新语言模型的直接偏好优化(DPO)方法,在极小规模的单对象对合成数据上进行微调,显著缓解了该偏差,使模型在合成数据上的四分类鲁棒准确率提升达100点,并在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升了68.0、32.6和20.1点,验证了该解决方案的有效性与泛化能力。
链接: https://arxiv.org/abs/2606.01914
作者: Chuang Ma,Qianying Liu,Tomoyuki Obuchi,Fei Cheng,Wang Yang,Sudong Cai,Shuyuan Zheng,Akiko Aizawa,Sadao Kurohashi
机构: Kyoto University (京都大学); NII LLMC (日本国立情报学研究所语言模型研究中心); RIKEN AIP (理化学研究所先进智能研究中心); Case Western Reserve University (凯斯西储大学); The Hong Kong Polytechnic University (香港理工大学); The University of Osaka (大阪大学); University of Tokyo (东京大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model’s decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.
[NLP-68] KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)在医疗领域应用中,通用语言模型对临床语言复杂性适应不足的问题。针对挪威语临床文本的特定需求,研究提出KliniskVestBERT,一套基于BERT架构的三种编码器模型,其在由Helse Vest提供的大规模真实世界、去标识化挪威语临床文本上进行预训练。该研究不仅对现有挪威语模型Nb-BERT-large、NorBERT3-large和ModernBERT进行了继续预训练,还通过精心筛选涵盖出院小结、手术报告、护理记录等多类文档的代表性临床语料,确保了对挪威医疗环境中语言特征的全面覆盖。实验结果表明,在三个合成挪威语临床基准数据集及两个真实世界任务上的评估中,所提出的临床专用模型均显著优于基线模型,验证了领域特定预训练在临床NLP任务中的关键价值。解决方案的核心在于利用高质量、大规模且具有代表性的本地化临床语料进行深度领域适配,从而显著提升模型对临床文本的理解与泛化能力。
链接: https://arxiv.org/abs/2606.01904
作者: Christian Autenried,Cosimo Persia
机构: Helse Vest ICT (Helse Vest 信息与通信技术); Helse Bergen (赫尔塞贝根); Helse Fonna (赫尔塞福纳); Helse Førde (赫尔塞福尔德); Helse Stavanger (赫尔塞斯塔万格); DIPS (DIPS)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing application of Natural Language Processing (NLP) in healthcare demands language models specifically attuned to the complexities of clinical language. This work introduces KliniskVestBERT, a suite of three BERT-based encoder models pre-trained on a substantial corpus of real-world, de-identified Norwegian clinical texts from Helse Vest. We continue pretraining existing language models Nb-BERT-large, NorBERT3-large, and ModernBERT on our specialized clinical dataset. This dataset is based on a representative population of Helse Vest patients. The included document types are carefully curated to encompass a broad clinical spectrum in bokmål and nynorsk including discharge summaries, surgical reports, nursing notes etc. ensuring comprehensive representation of the linguistic landscape within Norwegian healthcare settings. Evaluation on three synthtetic Norwegian clinical benchmark datasets and two real-world problems demonstrates that each of our clinically specialized models consistently outperforms their baseline counterparts, highlighting the significant benefit of domain-specific pre-training for NLP tasks within the clinical domain. The project was a joint effort by all Helse Vest entities (Helse Bergen, Helse Fonna, Helse Førde and Helse Stavanger) with DIPS under the project lead of Helse Vest ICT.
[NLP-69] he Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
【速读】: 该论文旨在解决生成式图像重建过程中视觉-语言模型与图像生成器之间协同优化的评估难题,核心问题是如何在多轮交互中有效衡量和提升图像重建质量。其解决方案的关键在于提出“图像重建游戏”(Image Reconstruction Game)这一全自动基准测试框架:通过让视觉-语言模型在多轮交互中向图像生成器发出修正指令,将双方逐步建立的共同理解以生成图像的形式直接可视化,从而实现对交互过程与最终结果的可度量评估。研究发现,描述模型(Describer)是决定重建质量的主要因素,而生成模型(Generator)则决定了迭代优化是否有效;数学与几何类图像最具挑战性;描述模型的词元预算(token budget)显著影响收敛行为——较短预算产生稀疏初始图像,为后续改进留出空间,而较长预算虽提升初始质量但减少可修正余地;表现更优的描述模型使用涵盖空间、数值与结构等多维度的丰富修正词汇,而较弱模型仅聚焦表面属性且过早终止交互。此外,人工验证表明,当前最佳自动化评判指标与人类偏好仅达轻微至中等一致性,提示自动化评分需经人工校准方可可靠应用。
链接: https://arxiv.org/abs/2606.01901
作者: Sherzod Hakimov,Mattia D’Agostini,Ivan Samodelkin,David Schlangen
机构: University of Potsdam, Germany; German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer’s token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.
[NLP-70] CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文化智能(Cultural Intelligence)评估中过度聚焦知识储备而忽视其在真实情境下有效运用已知文化知识的能力这一核心问题。现有研究将文化智能简化为知识层面的问答任务,未能充分检验模型在复杂社会语境中的推理能力。为此,作者提出CultureForest——一个基于文化规范(Cultural Norm)的基准测试框架,其每个问题均以一组原子级文化规范为根基,支持可验证、可归因的评估。该基准包含5,378个样本,覆盖8个领域与53个不同国家/地区,并支持从选择题到开放式生成的渐进式评估范式。实验表明,即使顶尖模型在开放式生成任务中也出现显著性能下降,且存在明显的跨区域差异。深入分析揭示四大关键规律:(1)推理阶段的即时推理增益有限,甚至可能加剧不平等;(2)模型表现出高度一致的区域偏好结构;(3)在严格文化约束下,模型响应呈现显著保守性;(4)通过解耦文化知识获取与文化推理过程,发现尽管模型具备丰富的文化知识,但其表现瓶颈在于如何有效应用这些知识。上述发现共同指向一个必要转变:从以知识为中心的评估范式转向衡量“知识驱动的文化推理”能力。
链接: https://arxiv.org/abs/2606.01879
作者: Yangfan Ye,Xiaocheng Feng,Jialong Tang,Xiayu Cao,Zihan Zhang,Xiachong Feng,Baosong Yang,Bing Qin
机构: Harbin Institute of Technology (哈尔滨工业大学); The University of Hong Kong (香港大学); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textitCultural Norm Grounded Reasoning. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.
[NLP-71] ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在差分隐私(Differentially Private, DP)条件下合成文本数据时,其是否能真正传递原始敏感语料库中独有的新知识与能力这一关键问题。现有评估方法多依赖于无需训练即可近似解决的任务,导致高性能表现无法证明DP合成数据可替代原始数据访问,因而存在评价偏差。为此,论文提出ContinuousBench——一个持续自动更新的基准测试框架,用于衡量DP合成文本带来的实际能力提升。该基准每季度发布一次,包含全新的训练语料库及其衍生的问答集,确保任务在无原始语料情况下不可解,同时在差分隐私保护下仍可通过数百条独立记录学习。研究者需基于训练语料生成DP合成数据,并在标准化训练与评估流程中测度性能增益。该框架设立两个赛道:Geminon(虚构生物的程序化生成数据集)和News(实时爬取的公共新闻流)。实验表明,尽管传统基准已趋于饱和,但在ContinuousBench上,非私有合成数据能有效迁移原语料中的知识,而当前最先进的DP合成方法即便在ε=100的宽松隐私预算下,也普遍无法实现类似的知识传递,揭示了现有DP文本合成技术在保留语义丰富性方面的根本局限。
链接: https://arxiv.org/abs/2606.01849
作者: Peihan Liu,Lucas Rosenblatt,Weiwei Kong,Natalia Ponomareva,Gautam Kamath,Rachel Cummings,Roxana Geambasu,Yu Gan,Lillian Tsai,Alex Bie
机构: Columbia University (哥伦比亚大学); NYU (纽约大学); Google Research (谷歌研究); University of Waterloo (Waterloo大学); Vector Institute (向量研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Datasets: this https URL ; Eval Harness: this https URL ; Blog post: this https URL
Abstract:Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly solvable without training, so strong benchmark performance does not establish that DP synthesis can substitute original data access. Thus, we introduce ContinuousBench, a continuously and automatically-regenerated benchmark that measures capability gain from DP synthetic text. Each quarter, a new release pairs a never-before-seen training corpus with a derived QA set, constructed to be: (1) unsolvable sans-corpus; and (2) learnable under DP, as the tested knowledge is supported by hundreds of independent records. Researchers produce DP synthetic data from the training corpus and run our standardized training and evaluation harness on their synthetic data to measure gains. We instantiate two tracks: Geminon, a procedurally-generated dataset about fictional creatures; and News, a stream of newly crawled public news articles. Although standard benchmarks are nearly saturated, on ContinuousBench we find that non-private synthesis transfers substantial knowledge from the original corpus, while state-of-the-art DP synthesis methods generally fail to do so, even at \varepsilon=100 .
[NLP-72] Unveiling the Limits of Large Language Models in Inferring Prag matic Meaning from Non-Verbal Responses
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话中对非言语行为(non-verbal behavior)所传达的间接语用意义(pragmatic meaning)理解能力不足的问题。尽管现有研究主要聚焦于模型对言语行为的语用理解,但非言语行为在人类沟通中具有基础性作用,尤其在孤立使用时可有效传递隐含意图。论文提出并开展首个系统性评估,针对仅由非言语回应构成的对话场景,考察模型识别间接意图的能力。其核心解决方案在于揭示:当前LLMs在处理非言语行为时表现显著下降,准确率相比言语情境最高降低60个百分点;通过深入分析发现,模型对非言语行为的理解存在特定的行为模式,且上下文学习(in-context learning)能有效促进其语用推理能力。因此,提升模型对非言语意图的理解关键在于优化上下文引导机制,并设计更符合非言语语用特征的训练与推理范式。
链接: https://arxiv.org/abs/2606.01845
作者: Sugyeong Eo,Heuiseok Lim
机构: Yonsei University (延世大学); Korea University (高丽大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Although large language models (LLMs) have shown considerable progress in pragmatic language understanding, prior research has focused mainly on their comprehension of verbal behavior. Nonetheless, non-verbal behavior remains a fundamental component of human communication, especially when deliberately utilized in isolation to convey indirect meanings. In this work, we present the first systematic evaluation of LLMs’ ability to infer pragmatic meaning in dialogue consisting solely of non-verbal responses. We explore three research questions: (1) Can LLMs recognize indirect intent conveyed through non-verbal responses? (2) When and how do LLMs fail to capture non-verbal intent? (3) How can we improve LLMs’ ability to interpret non-verbal intent?. Through the evaluation, we observe that LLMs struggle to infer underlying meaning from non-verbal responses, with accuracy dropping by up to 60% points compared to verbal ones. Further extensive analysis reveals a behavioral pattern in LLMs’ interpretations of non-verbal behavior and demonstrates that in-context learning facilitates pragmatic inference.
[NLP-73] LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agent ic Language Models
【速读】: 该论文旨在解决生成式语言模型在智能体(Agentic)工作流中因步骤类型异质性导致的计算资源分配不均问题。具体而言,智能体系统在执行过程中交替进行结构化工具调用(短、确定性、低困惑度)与开放式规划/推理(长、复杂、高困惑度)两种不同性质的步骤,而现有推理系统对所有步骤施加相同的计算开销,造成资源浪费。为此,论文提出一种轻量级适配器 LayerRoute,其核心创新在于实现基于输入类型的动态层跳过机制:通过为 Qwen2.5-0.5B-Instruct 的每个 24 层 Transformer 增加一个轻量级路由模块(每层约 897 参数,采用直通估计器输出硬二值门控)和 LoRA 适配器(秩为 8,共约 1.08M 参数),在保持主干权重冻结的前提下,训练模型识别并跳过特定输入类型下冗余的计算层。实验表明,经过仅 3,000 步(6.4 分钟,A100 40GB)的端到端训练,LayerRoute 实现了 12.91% 的跳过差异率——工具调用步骤可跳过 15.25% 的浮点运算量,而规划步骤仅跳过 2.34%,体现了对不同任务复杂度的自适应优化。同时,得益于 LoRA 适配带来的微调能力,模型在工具调用和规划任务上的困惑度分别降低 1.29 和 1.30,显著提升了生成质量,且仅需 1.10M 可训练参数(占主干参数总量的 0.22%),展现出高效、可扩展的推理优化潜力。
链接: https://arxiv.org/abs/2606.01838
作者: Prateek Kumar Sikdar
机构: Accenture(埃森哲)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 3 figures, 4 tables
Abstract:Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.
[NLP-74] alkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech
【速读】: 该论文旨在解决临床与发育语言研究中细粒度形态句法错误标注(fine-grained morphosyntactic error annotation)面临的劳动密集、依赖专家且难以扩展的问题。其核心解决方案是提出TalkTag,一个基于大语言模型(LLM)的轻量级工具,通过在极低数据资源条件下对儿童叙事语料进行微调,实现了对口语转录文本中CHAT风格错误标注的自动化。该方法的关键在于利用有限标注数据训练出具备泛化能力的模型,在保持较高标注精度的同时,能够有效识别因语言歧义导致的复杂标注场景,从而为低资源环境下的语言分析提供可扩展、实用的自动化支持。
链接: https://arxiv.org/abs/2606.01820
作者: Shamira Venturini(1 and 2),Oliver Hennhöfer(2),Steffen Kinkel(2),Jannik Strötgen(2) ((1) Karlsruhe Institute of Technology, (2) Karlsruhe University of Applied Sciences)
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Karlsruhe University of Applied Sciences (卡尔斯鲁厄应用科学大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children’s narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.
[NLP-75] CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation
【速读】: 该论文旨在解决大语言模型(LLM)代理在真实服务场景中评估时面临的挑战,即如何有效模拟复杂的任务依赖关系、不完美的用户行为以及允许多种合理解决方案的评估机制。其核心解决方案在于提出CRAB-Bench(基于约束的真实代理基准测试)与RUSE(真实用户仿真引擎):CRAB-Bench通过多实体间依赖的约束图生成包含结构化干扰项的任务,使代理需在成千上万条误导性候选方案中识别极少数有效解,从而考验其精细推理能力;RUSE则基于人类行为学研究,构建具有多样化人格特征和四个行为维度的真实用户模型,替代传统模板化、合作型仿真器。实验表明,即使是最先进的LLM代理在CRAB-Bench上的pass@1指标也仅达61%,而切换至RUSE后性能进一步下降高达57%,且损失主要体现在任务求解能力而非对话质量。其中,“信息泄露”是破坏性最强的行为维度,且与RUSE交互的代理更倾向于通过隐式修正掩盖错误,而非主动承认失误。
链接: https://arxiv.org/abs/2606.01815
作者: Danqing Wang,Akshay Sivaraman,Lei Li
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address this gap. CRAB-Bench generates tasks via a constraint graph over multiple interdependent entities with structured distractors, requiring agents to reason carefully over thousands of misleading candidates where only a tiny fraction of solutions are valid. RUSE replaces cooperative, template-like simulators with realistic users grounded in human behavioral studies, instantiated across diverse personas and four behavioral dimensions. Experiments on four frontier LLM agents show that the best model achieves only 61% pass@1 on CRAB-Bench, and switching to RUSE causes further drops of up to 57%, concentrated in task-solving ability rather than conversational quality. Information Disclosure is the most damaging behavioral dimension, and agents interacting with RUSE are less likely to admit mistakes, instead masking errors through implicit corrections.
[NLP-76] Cost-Aware Diffusion Draft Trees for Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)推理过程中因草稿生成与验证阶段的资源分配不合理而导致的吞吐量瓶颈问题。现有方法如DDTree虽通过构建候选树以最大化预期接受长度,但其预算选择缺乏成本意识:接受长度随预算增加而单调递增,导致在未考虑验证延迟的情况下盲目扩大树规模,无法实现真正的性能优化。其解决方案的关键在于提出一种成本感知的扩散草稿树(Cost-aware Diffusion Draft Tree, CaDDTree),该方法直接以单位时间内生成的令牌数(token throughput)为优化目标,联合优化树结构与节点预算。通过显式建模草稿生成与验证延迟,并证明在验证成本凸性假设下,吞吐量函数具有单峰性(unimodal),从而可采用高效的贪心停止规则进行在线预算自适应调整。该方法无需离线预算搜索,能根据每轮位置分布与实时验证开销动态调节预算,在Qwen3-4B与Qwen3-8B模型上覆盖推理、编程与指令遵循等八项基准任务的实验表明,CaDDTree在绝大多数任务中达到或超越使用理想预算的DDTree性能,显著提升了推理效率与实用性。
链接: https://arxiv.org/abs/2606.01813
作者: Shuai Zhang,Huachuan Qiu,Hongliang He,Yong Dai
机构: Zhejiang University(浙江大学); Westlake University(西湖大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tree that maximizes expected acceptance length under a fixed node budget. We observe, however, that acceptance length is non-decreasing in budget: it always favors larger trees regardless of verification cost, offering no principled basis for budget selection. We introduce \textbfCaDDTree (Cost-aware Diffusion Draft Tree), a method that directly optimizes token throughput (expected tokens generated per unit time) by jointly selecting the tree structure and node budget. We model draft and verification latencies explicitly, show that the throughput objective decomposes into a per-round one-dimensional search over the budget, and prove that under a convex verification cost the throughput function is \emphunimodal, enabling an efficient greedy stopping rule. CaDDTree requires no offline budget search, adapting the budget each round from the current per-position distributions and verification cost. Experiments on Qwen3-4B and Qwen3-8B across eight benchmarks spanning reasoning, coding, and instruction-following tasks show that \caDDTree matches or surpasses DDTree with oracle budget selection on nearly all tasks.
[NLP-77] “Ive Seen How This Goes”: Characterizing Diversity via Progressive Conditional Surprise ICML2026
【速读】: 该论文旨在解决生成式内容多样性评估的难题,特别是在后训练阶段模式崩溃(post-training mode collapse)检测、解码策略比较以及人工智能与人类写作中创造性行为量化等场景下的核心挑战。传统方法通常依赖于预训练嵌入模型、参考语料库或人工标注,存在成本高、泛化性差等问题。本文提出一种基于上下文学习(in-context learning)的新颖多样性度量方法——Decan指标(DCan=C×an),其关键在于:通过在单次前向传播中读取基础模型θ对每种输入排列的逐标记对数概率,直接计算每个字节级别的多样性得分,无需额外训练专用模型、不依赖嵌入模型、参考语料库或人工标签。该方法建立在信息论基础上,利用语言模型的上下文学习能力捕捉任意数量输入之间的广泛相似性,实现了高效、无监督且可扩展的多样性评估。实验表明,该指标在Tevet和Berant的人类基准McDiv数据集上取得0.846的OCA分数,接近最强神经基线SentBERT(0.897);在OLMo-2-7B模型从基础模型到SFT、DPO再到RLVR的微调流程中,DCan呈现单调下降趋势,有效识别出创造性写作应用所关注的多样性损失类型,验证了其对实际生成质量变化的敏感性。
链接: https://arxiv.org/abs/2606.01811
作者: Matthew Khoriaty,David Williams-King,Shi Feng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: this https URL
Abstract:Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan’’ metric, D_Ca_n = C \times a_n , is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model \theta in a \emphsingle forward pass per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant’s human-grounded McDiv benchmark, D_Ca_n reaches OCA 0.846 on the McDiv prompt_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, D_Ca_n drops monotonically across the base \to SFT \to DPO \to RLVR stages, detecting the type of diversity loss that creative-writing applications care about.
[NLP-78] ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference ACL
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在资源受限环境下部署时面临的性能与参数量之间的权衡问题。尽管SLMs在能力与计算可行性之间取得平衡,但其在实际应用中仍可能因硬件限制而难以高效部署。现有方法多依赖启发式策略进行模型压缩,缺乏对模型内部表征与任务相关性的系统性分析。本文提出ProbScale框架,其核心在于融合神经网络缩放定律(neural scaling laws)与语言模型探针(language model probing)的洞察,通过构建任务特定的探针来数学量化每个模型层对下游任务能力的贡献度,从而识别出参数高效子网络。该方法将子网络选择建模为在参数预算约束下最大化加权任务探针性能的优化问题,实现性能与参数规模的最优权衡。实验结果表明,在RoBERTa-Large和T5-Base等代表性SLMs上,ProbScale可实现5至10倍的参数缩减,同时保持原始模型95%至98%的性能表现,显著优于传统启发式基线方法。
链接: https://arxiv.org/abs/2606.01806
作者: Sourav Das
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, ACL
Abstract:Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model’s internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.
[NLP-79] Multilinguality of Large Language Models From a Structural Perspective
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多语言处理中对非英语文本的结构化表征理解不足的问题。尽管现有研究通过分析词元(token)层面的表示揭示了模型对非英语文本的处理机制,但这些方法未能捕捉语言固有的结构性特征。为此,本文提出基于表征结构分析(representational structural analysis)的新方法,其关键在于从语言的结构性差异出发,系统考察不同资源水平语言在模型内部表征中的结构分化程度。研究发现,低资源语言在结构上相较于英语更具差异性,而经过语言特定的后训练(language-specific post-training)虽会改变其内部结构,却能有效维持跨语言之间的关系一致性,从而为多语言模型的可解释性与跨语言泛化能力提供了新的理论视角。
链接: https://arxiv.org/abs/2606.01800
作者: Haruki Sakajo,Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (NAIST)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.
[NLP-80] HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems
【速读】: 该论文旨在解决大语言模型(LLM)智能体在异构任务场景中因执行范式差异而导致的系统适应性不足问题,尤其针对现有方法仅关注外部调用框架(harness)或内部推理策略(policy)的孤立优化,缺乏对整体系统层面的元适应(meta-adaptation)机制。其核心挑战在于:结构与执行之间的适应空间未被显式定义,且外部调用框架与内部推理策略之间缺乏协同兼容性优化。为此,论文提出HarnessForge——一种面向LLM智能体系统的元自适应框架,其关键创新在于将智能体系统形式化为“调用框架-策略”(harness–policy)对,通过显式分离执行结构(harness-level)与推理行为(policy-level),构建稳定可扩展的适应空间;进而采用故障引导的框架定制(fault-guided harness tailoring)与框架条件下的策略对齐(harness-conditioned policy alignment)实现框架与策略的协同演化。实验在五个跨领域基准上验证了该方法的有效性,表明该框架在提升Qwen3-4B和Qwen3-8B模型性能方面显著优于仅优化框架或仅优化策略的基线,最高性能提升达12.0%,同时展现出良好的推理效率权衡,证明了框架与策略间可执行兼容性对智能体系统适应性的决定性作用。
链接: https://arxiv.org/abs/2606.01779
作者: Mingju Chen,Can Lv,Guibin Zhang,Heng Chang,Shiji Zhou
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 13 figures
Abstract:LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained underlying reasoning policies, full-system adaptation remains insufficiently characterized. The adaptation space between structure and execution is rarely made explicit, and the compatibility between the external harness and the internal reasoner is not optimized jointly. We propose HarnessForge, a meta-adaptive framework for evolving LLM agent systems. HarnessForge formulates an agent system as a harness–policy pair, defining a stable adaptation space that separates harness-level execution structure from policy-level reasoning behavior. It then performs harness–policy co-evolution through fault-guided harness tailoring and harness-conditioned policy alignment. Experiments across five benchmarks from diverse domains show that HarnessForge consistently improves both Qwen3-4B and Qwen3-8B backbones, outperforming harness-only and policy-only baselines with gains of up to 12.0% over the strongest baseline and achieving favorable rollout-efficiency tradeoffs, demonstrating that harness–policy co-evolution is effective, and that executable compatibility between the harness and reasoning policy is essential for agent-system adaptation. The code is available at this https URL.
[NLP-81] An Algebraic View of the Expressivity of Recurrent Language Models ICML2026
【速读】: 该论文旨在解决循环神经语言模型(Recurrent Neural Language Models)在形式语言识别能力上的理论矛盾问题:现有文献中存在冲突结论——部分研究声称其具备图灵完备性,而另一些研究则指出其表达能力等价于正则语言。造成这一分歧的根本原因在于不同研究采用的算术模型(arithmetic model)存在差异。为此,论文提出一种统一的代数框架来系统分析循环神经网络的表达能力,从形式化角度刻画多种算术模型,并将表达能力问题转化为代数可判定性问题,例如判断网络的句法幺半群(syntactic monoid)是否整除某一特定的幂积(wreath product)。作为案例研究,论文重新审视了对角状态空间模型(diagonal state-space models),发现当强制使用浮点数递归时,同一架构无法实现偶模计数器(even-modulus counter),但在无符号整数量化条件下却可实现任意偶模计数器,从而揭示了算术模型对网络表达能力的关键影响。
链接: https://arxiv.org/abs/2606.01765
作者: Franz Nowak,Ryan Cotterell,Reda Boumasmoud
机构: 未知
类目: Formal Languages and Automata Theory (cs.FL); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 28 pages, 2 figures, to be published at ICML 2026
Abstract:What formal languages can a recurrent neural language model recognize? Formal results in the literature conflict: some authors report Turing-completeness, while others show equivalence to regular languages. The reason for this discrepancy is that the underlying arithmetic model differs. The paper develops a unified algebraic account of the expressivity of recurrent neural networks, starting with a formal account of various arithmetic models. This account reduces expressivity to an algebraic question, e.g., whether a network’s syntactic monoid divides a certain wreath product. As a case study, the paper revisits diagonal state-space models: the same architecture cannot implement an even-modulus counter once floating-point recurrences are enforced, yet realizes every even-modulus counter under unsigned-integer quantization.
[NLP-82] riAlign: Towards Universal Truth Consistency in Personalized LLM Alignment
【速读】: 该论文旨在解决个性化大语言模型(Personalized Large Language Models, PLLMs)在适应用户偏好与社会属性时,导致不同社会群体间普遍真理一致性显著下降的问题。具体而言,现有对齐方法或忽略个性化,或仅关注主观偏好对齐,忽视了客观任务中跨群体的公平性与真理一致性。为填补这一空白,论文提出“真理不变对齐”(Truth-Invariant Alignment, TIA)这一新范式,其核心目标是在保持个性化的同时,确保普遍真理在不同社会群体间保持一致。解决方案的关键在于提出TriAlign——首个面向TIA的离线多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)框架,将每个社会群体建模为独立智能体进行交互;通过引入兼顾公平性的目标函数和显式的不一致性惩罚项,联合优化普遍真理准确性、跨群体真理一致性与个性化水平。实验结果表明,TriAlign在多个基准测试中显著优于现有强基线,有效缓解了群体间的普遍真理差异,同时提升了客观任务性能与个性化质量。
链接: https://arxiv.org/abs/2606.01755
作者: Thi-Nhung Nguyen,Linhao Luo,Rollin Omari,Junae Kim,Thuy-Trang Vu,Dinh Phung
机构: Monash University (莫纳什大学); Defence Science and Technology Group (澳大利亚国防科学技术集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Personalized large language models adapt responses to users’ preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either ignore personalization or mainly focus on subjective preference alignment, largely overlooking fairness and consistency in universal truths. To address this gap, we study Truth-Invariant Alignment (TIA), an alignment problem for personalized LLMs that aims to ensure universal truths remain consistent across social groups while preserving personalization. We propose TriAlign, the first offline multi-agent reinforcement learning (MARL) framework for TIA, where each social group is modeled as an agent interacting. TriAlign jointly optimizes universal truth accuracy, cross-group truth consistency, and personalization through a fairness-aware objective and an explicit inconsistency penalty. Experiments across diverse benchmarks demonstrate that TriAlign achieves a stronger balance among these three objectives than strong baselines, reducing universal truth disparities across social groups while improving both objective task performance and personalization quality.
[NLP-83] Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks
【速读】: 该论文旨在解决传统历史文本中存在语言歧义、上下文依赖性强的指代问题以及缺乏统一语法规则所带来的知识抽取难题。其核心解决方案是构建一种融合双向编码器表示的Transformer(BERT)与图神经网络(GNN)的联合架构,通过上下文敏感的语义表征与关系图学习相结合的方式,实现对多源异构历史文本中实体与关系的高精度自动抽取。该方法能够有效处理复杂嵌套结构和隐含指代等挑战性问题,在市政档案、议会文件及历史书信等大规模史料数据上验证了其优越性,显著优于传统的规则驱动方法与其他主流深度学习基线模型,从而实现了历史知识的系统化结构化转化,为知识库积累提供智能化支持。
链接: https://arxiv.org/abs/2606.01747
作者: Ping Li,Bartlomiej Brzozka
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and graph neural networks (GNN) to extract the entities and relationships from various types of historical texts. The texts of traditional history resolve linguistic ambiguities, references limited by context, and a lack of established grammatical norms in a systematic way. This study develops a new image retrieval system based on FastRQNet and pre-trained vision-language model Vilt-qaformer+RoBInet in accordance with the aforementioned recommendations. The experiments make full use of a comprehensive collection of municipal records, parliamentary documents, and historical correspondence. When compared to conventional rule-based techniques and other popular deep-learning baselines, the joint BERT-GNN system obtains greater Precision, Recall, and F1-score (Table 2). Complex nested structures and implicit reference issues can be handled by this structure with sufficient accuracy and thoroughness when creating knowledge graphs. The aforementioned experiments show that combining relational graph learning algorithms with context-sensitive semantic representation techniques can automatically extract historical data to add accumulated wisdom to the knowledge repository.
[NLP-84] HRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models
【速读】: 该论文旨在解决多轮越狱攻击(multi-turn jailbreak attacks)对大语言模型(LLM)构成的安全威胁,此类攻击利用对话过程中的渐进式升级与跨轮次协同机制,现有防御方法或依赖高成本的重新训练(常导致模型能力下降),或在每一轮独立进行单轮分析,无法捕捉风险在交互轨迹上的累积效应。其核心解决方案在于提出首个无需训练的防御框架THRD,关键创新在于显式建模时间维度上的风险累积过程。THRD通过四个模块协同实现:逐轮风险评估器(TRA)用于即时风险估计,历史上下文分析器(HCA)检测跨轮次意图升级,响应评估器(RE)识别助长性输出,并由决策模块结合上述信号,采用随时间演化的评分机制,引入衰减调制与趋势感知调整策略,动态整合多轮信息。实验表明,THRD在对抗先进多轮攻击(包括基于树搜索和多智能体协作的方法)时,将攻击成功率(ASR)降至0.2%–4.0%,同时在MMLU和GSM8K上仅造成不超过1.5%的性能退化,且消融实验证明各模块贡献非冗余、具备跨架构稳定性。对首次拒绝触发点的分析显示,超过70%的多轮攻击需至第2轮或更晚才能被识别,验证了显式时间聚合机制的必要性。
链接: https://arxiv.org/abs/2606.01738
作者: Zhiqing Ma,Zhonghao Xu,Dong Yu,Chen Kang,Changliang Li,Pengyuan Liu
机构: Beijing Language and Culture University (北京语言大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining – often degrading model utility – or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model’s conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks – including tree-search-based and multi-agent collaborative methods – across two target models show that THRD reduces ASR to 0.2–4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.
[NLP-85] Argument Collapse: LLM s Flatten Long-Form Public Debate
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成公共论辩文本时导致的“论证坍缩”(argument collapse)问题,即不同模型生成的论述趋于收敛至少数几个主流论点、次级论据及固定段落结构,从而削弱公共讨论的多样性与深度。其解决方案的关键在于揭示并量化这种坍缩现象:通过对比来自《纽约时报》(NYT)和《波士顿评论》(BR)的数千条人类论辩文本与数万条由LLMs生成的论文,研究发现人类论点中约65.3%为特定辩论中的独有主论点,而LLM生成的主论点仅有3.4%具有唯一性;在次级论据层面,人类有41.0%的子论点为独特,而LLM仅9.1%;且LLM倾向于重复使用泛化、模糊化的子论据,并采用高度模式化的论述结构(如直接陈述主张后迅速转向对策),缺乏人类论辩中常见的具体性与多样性。尽管要求模型生成多样化回答可部分缓解此问题,但其新增变异大多超出人类实际论辩空间,表明当前生成式AI在维持公共话语多样性方面仍存在根本性局限。
链接: https://arxiv.org/abs/2606.01736
作者: Yekyung Kim,Yapei Chang,Chau Minh Pham,Mohit Iyyer
机构: University of Maryland, College Park (马里兰大学学院公园分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.
[NLP-86] RCEM: Embedder Equipped with Query Rewriting Skill for Robust Conversational Search in Distributional Shift
【速读】: 该论文旨在解决检索增强生成(RAG)系统中多轮对话搜索(conversational search)面临的挑战,即如何在不依赖显式查询重写的情况下实现上下文感知的精准检索。现有方法通常直接学习对话到文档的匹配关系,但在分布外(distributional shift)场景下表现脆弱,且依赖高质量的对话查询-文档相关性标注,这类数据获取成本高、难度大。其解决方案的关键在于提出RCEM模型,通过将大语言模型(LLM)的查询重写能力蒸馏至嵌入模型中,使对话查询嵌入与重写后查询嵌入对齐,从而在推理阶段无需显式重写即可实现上下文敏感的检索。该方法避免了对复杂相关性标注的依赖,同时保持了原始嵌入模型的独立检索功能,支持单一模型同时处理独立查询与对话查询,并兼容现有文档索引,无需重建检索数据库。实验结果表明,RCEM在QReCC、TopiOCQA和TREC CAsT等多个基准上均显著优于主流基线,尤其在分布外场景下表现出更强鲁棒性,Recall@10提升最高达20%。
链接: https://arxiv.org/abs/2606.01697
作者: Kilho Son,Paul Hsu,Cha Zhang,Dinei Florencio
机构: Microsoft(微软)
类目: Computation and Language (cs.CL)
备注:
Abstract:Conversational search has become increasingly important in retrieval-augmented generation (RAG) systems, where users interact with AI assistants through multi-turn conversations containing context-dependent queries. We propose RCEM, a conversational dense retrieval model that distills the query reformulation capability of LLMs into the embedding model, enabling context-aware retrieval without explicit query rewriting during inference. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-document matching, RCEM aligns conversational-query embeddings with rewritten-query embeddings, improving robustness under distributional shift. RCEM does not require conversational query-to-document relevance mappings for training, which are often expensive and difficult to obtain with high quality. Extensive experiments on QReCC, TopiOCQA, and TREC CAsT demonstrate that RCEM consistently outperforms strong conversational retrieval baselines, achieving particularly large gains under distributional shift, including up to 20% improvement in Recall@10. RCEM further extends the base embedding model with conversational query rewriting capability while preserving its original retrieval functionality, allowing both standalone and conversational queries to be encoded by a single model and searched against existing document indexes without rebuilding the retrieval database.
[NLP-87] Off-the-Shelf LLM s as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
【速读】: 该论文旨在解决小模型在生成过程中因提前锁定错误推理路径而导致性能下降的问题,尤其是在使用强评分器(strong scorer)进行后验选择时无法纠正已有错误的局限性。现有方法如基于奖励模型的路径选择(PRM guided search)虽可实现生成过程中的动态修正,但依赖于需大量人工标注的逐步奖励标签,训练成本高且难以泛化。为此,本文提出一种无需训练的**分块级引导生成(Chunk-Level Guided Generation)**框架,其核心在于利用现成的大语言模型(LLM)作为过程评分器,在每一步生成中对固定长度的候选片段(chunk)进行评分,而非对变长推理步骤进行评估。关键创新点在于:通过引入固定长度的候选块设计,有效规避了大模型在评估变长文本时存在的系统性长度偏差问题(length bias),并显著提升评分可靠性。具体实现上,提出了两种选择策略——似然引导选择(LGS)与对比引导选择(CGS),其中后者通过减去小模型自身的似然得分,突出大模型与小模型偏好差异的片段,从而更有效地引导生成方向。实验表明,该方法在GSM8K、MATH、Minerva Math等数学推理基准上表现优异,尤其在相同引导预算下超越多数投票法达28个百分点,并接近甚至超过需训练奖励模型的PRM方法,同时生成的推理轨迹显著更短,具备更高的效率与实用性。
链接: https://arxiv.org/abs/2606.01682
作者: Atoosa Chegini,Soheil Feizi
机构: University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model’s log-probability to favor chunks where the large model’s preference diverges from the small model’s. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4–6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.01682 [cs.CL] (or arXiv:2606.01682v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.01682 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Atoosa Malemir Chegini [view email] [v1] Mon, 1 Jun 2026 04:43:36 UTC (61 KB)
[NLP-88] Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification
【速读】: 该论文旨在解决多模态大语言模型(Multimodal LLMs)在科学同行评审中判断论文论断是否得到证据支持时,对图表(chart)证据的处理效果显著低于表格(table)证据的问题。其核心问题是:模型是否无法从图表中提取信息,还是能够提取但未能有效利用这些信息进行推理?研究通过层间线性探测(layer-wise linear probing)与注意力机制分析,在三款开源视觉-语言模型(VLMs)上对同一组数据的表格与图表证据进行对比,发现模型确实能够将图表信息编码至中间表示层,但这些信息未能有效传递至最终预测位置,而这一“信息路由断裂”现象在表格任务中并不存在。进一步的注意力分析揭示,这种断连在不同模型家族中呈现出两种架构上迥异的形式。因此,该研究的关键发现是:表-图性能差距的本质并非源于视觉信息编码失败,而是源于预测阶段信息传递路径的失效,即“编码成功但路由失败”。
链接: https://arxiv.org/abs/2606.01679
作者: Sunisth Kumar,Xanh Ho,Tim Schopf,Andre Greiner-Petter,Florian Boudin,Akiko Aizawa
机构: The University of Tokyo (东京大学); NII LLMC (国立情報学研究所语言模型与计算中心); National Institute of Informatics (日本国立情报学研究所); University of Göttingen (哥廷根大学); Inria, LS2N, Nantes Université (法国国家信息与应用数学研究院,南特大学计算机科学与网络实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of the same underlying data. This raises the question of whether models fail to extract information from charts, or do they extract it but fail to use it when forming their prediction? We study this question through layer-wise linear probing and attention analysis on three open-weight VLMs over table and chart evidence, representing the same underlying data. We find consistent evidence for the latter. Chart information is encoded in the models’ intermediate representations but does not reach the prediction position, a gap that is absent for tables and holds across all conditions tested. Attention analysis further reveals that this disconnect takes two architecturally distinct forms across model families. These findings reframe the table-chart gap as a failure of how encoded visual information is routed at prediction time, rather than a failure of encoding itself.
[NLP-89] Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes
【速读】: 该论文旨在解决生成式AI在跨机构临床文本中识别自伤行为时性能下降的问题。其核心挑战在于,尽管单个医院的自然语言处理(Natural Language Processing, NLP)模型在检测急诊科(Emergency Department, ED)分诊记录中的自伤行为方面表现良好,但在不同医疗机构间迁移时性能显著降低。解决方案的关键在于揭示并分析不同医院在自伤相关文本表达上的词汇特征、预测性特征重要性及显著主题的差异。研究发现,虽然核心主题如自中毒和自伤保持一致,但各机构在语言表达方式和关键预测特征上存在明显变异,这种文档层面的机构特异性差异是导致模型泛化能力不足的主要原因。因此,提升模型跨机构适用性的关键在于充分考虑并建模这些机构间的语义与表达差异,例如通过引入领域自适应机制或构建更具鲁棒性的特征表示。
链接: https://arxiv.org/abs/2606.01678
作者: Liuliu Chen,Mike Conway,Jo Robinson,Vlada Rozova
机构: The University of Melbourne(墨尔本大学); Orygen, The National Centre of Excellence in Youth Mental Health(青年心理健康国家卓越中心); Centre for Youth Mental Health(青年心理健康中心); Centre for Digital Transformation of Health(健康数字化转型中心)
类目: Computation and Language (cs.CL)
备注: Accepted to CLPsych2026
Abstract:Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. Our findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.
[NLP-90] When Meaning Travels: A Granular Lens on Hybrid-MoEs Role in Idiomatic Understanding for Language Models
【速读】: 该论文旨在解决低资源东南亚语言(如印地语、孟加拉语和泰语)中习语的隐喻性与文化语义在计算建模及跨语言迁移中的难题,尤其针对其深层隐喻复杂性导致的表意丢失问题。其核心解决方案是构建一个包含3,533个多语言习语的重构多模态习语语料库——Varnika,该语料库融合了文本与视觉表征,并标注了七种习语语调(idiomatic tones)。同时,提出一种混合专家模型(Hybrid Mixture-of-Experts, HybridMoE),通过整合被选中与未被选中专家的输出,实现受控的混合化以缓解专家稀疏性问题,并引入习语属性信号(Idiomatic Property Signals)增强掩码多模态嵌入的表征能力。为全面评估模型性能,设计了IDIO-TONE与习语验证得分(Idiomatic Validation Score)的三阶段评估体系,分别衡量字面翻译保真度、视觉-语义对齐程度以及习语意义保留率。实验表明,HybridMoE在先进视觉-语言模型上实现了5–6%的性能提升,显著增强了多语言多模态场景下对隐喻语言与文化嵌入意义的表示能力。
链接: https://arxiv.org/abs/2606.01671
作者: Sarmistha Das,Vaibhav Vishal,Shreyas Guha,Amaan Ali,Kitsuchart Pasupa,Sriparna Saha
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural semantics in low-resource Southeast Asian languages such as Hindi, Bengali, and Thai, where culturally rich idioms pose significant obstacles for computational modeling and cross-linguistic transfer due to their deep metaphorical complexity. To tackle such complexity, we present Varnika, a reconstructed multimodal idiom corpus comprising 3,533 multilingual idioms, enriched with seven idiomatic tones aligned with both textual and visual representations. Additionally, to infer informative idiomatic understanding, we introduce a Hybrid Mixture-of-Experts (HybridMoE) framework that embeds multiple idiomatic expert opinions while mitigating expert sparsity by integrating outputs from both selected and unselected experts through controlled hybridization, further augmented with Idiomatic Property Signals via masked multimodal embeddings. To analyze the performance across multiple dimensions, we propose the IDIO-TONE and Idiomatic Validation Score, a three-stage evaluation pipeline measuring (i) literal translation fidelity, (ii) visual-semantic alignment, and (iii) idiomatic meaning retention. Empirical evaluations highlight that HybridMoE achieves 5–6% performance gains across advanced vision language models, demonstrating improved representation of figurative language and culturally embedded meaning in multilingual multimodal settings
[NLP-91] MobEvolve: An Agent ic Self-Evolving Heuristic System for Interpretable Human Mobility Generation
【速读】: 该论文旨在解决人类出行生成(human mobility generation)中长期存在的多维度挑战,即在保证个体轨迹保真度、群体层面分布一致性、行为合理性以及推理高效性的同时,维持模型的可解释性。现有方法如深度生成模型、基于大语言模型(LLM)的方法及传统启发式规则,难以兼顾上述多重需求。其解决方案的关键在于提出首个代理自演化启发式框架——MobEvolve:该框架以行为驱动的启发式系统为初始结构,利用一个LLM代理通过迭代方式不断演化内部逻辑;通过诊断验证集上的实际偏差与失败案例,代理能够提出针对性优化策略,并积累演化记忆实现持续自我改进。实验结果表明,MobEvolve在新加坡与蒙特利尔两个基准数据集上显著优于当前最先进的深度生成与基于LLM的方法,在保持高推理效率和可解释性的前提下,全面提升了个体轨迹保真度、群体分布对齐性与行为合理性。
链接: https://arxiv.org/abs/2606.01640
作者: Junlin He,Yihong Tang,Tong Nie,Ao Qu,Yuebing Liang,Hamzeh Alizadeh,Bang Liu,Wei Ma,Lijun Sun
机构: The Hong Kong Polytechnic University(香港理工大学); McGill University(麦吉尔大学); MIT(麻省理工学院); Tsinghua University(清华大学); Autorité régionale de transport métropolitain(大都会交通区域管理局); Université de Montréal(蒙特利尔大学); Mila – Quebec AI Institute(魁北克人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maintaining interpretability, behavioral plausibility, population-level distributional alignment, and inference efficiency. To bridge this gap, we introduce MobEvolve, the first agentic self-evolving heuristic framework for human mobility generation. MobEvolve initializes a behavior-inspired heuristic system and employs an LLM agent to iteratively evolve its internal logic. By diagnosing empirical misalignments and failure cases on a validation set, the agent proposes targeted updates and accumulates evolution memory for cumulative self-improvement. Extensive evaluations on the Singapore and Montreal benchmarks demonstrate that MobEvolve significantly outperforms state-of-the-art deep generative and LLM-based methods in individual trajectory fidelity, population-level distribution alignment, and behavioral plausibility, while preserving interpretability and high inference efficiency.
[NLP-92] Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多智能体系统中因社会性影响导致的“从众偏差”(conformity)问题,即模型可能因其他智能体的一致回答而放弃自身原本正确的判断,从而引入新错误。其核心挑战在于:当模型基于同伴反馈进行修正时,这种修正究竟是有助于纠正错误,还是反而加剧了错误传播。研究的关键发现是,在控制实验中,同伴共识结构显著增强了对初始正确答案的误导效应,远高于对初始错误答案的纠正能力;同时,赋予同伴权威标签会进一步提升模型采纳其所推荐答案的概率,无论该答案是否正确。更令人担忧的是,常见的生成式推理干预手段(如思维链Chain-of-Thought和反思Reflection)无法可靠地抑制有害修订行为,同时保持有益修订的有效性。因此,该研究提出的核心解决方案在于:多智能体大模型系统不应简单地聚合或信任同伴输出,而应建立对同伴回答的验证机制,以保障决策的可靠性与鲁棒性。
链接: https://arxiv.org/abs/2606.01637
作者: Jiaming Qu,Lucheng fu,Yibo Hu
机构: Amazon(亚马逊); Georgia Institute of Technology(佐治亚理工学院); Illinois Institute of Technology(伊利诺伊理工大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly used in multi-agent systems, where they see and respond to other agents’ answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.
[NLP-93] AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)后训练过程中响应令牌(response token)选择缺乏系统性评估框架的问题。现有方法多依赖局部启发式规则,未能将令牌选择建模为对个体响应令牌的合理价值评估。其核心解决方案是提出AlphaToken,一个将令牌价值评估解耦为适应性(adaptation,促进目标任务学习)与稳定性(stability,保持预训练能力)两个目标的框架,并通过结合局部梯度的直接路径信号与自回归生成中的下游因果路径信号,实现路径感知(path-aware)的价值评估。由于保留数据通常不可用,AlphaToken采用以预训练参考模型为锚点的Fisher-漂移代理(Fisher-drift proxy)来近似稳定性。为提升计算效率,该方法进一步将Ghost Dot-Product扩展至令牌级价值评估。实验表明,AlphaToken在微调和偏好优化中屏蔽低价值令牌,使训练信号聚焦于高价值位置,显著提升了后训练性能并有效缓解灾难性遗忘。
链接: https://arxiv.org/abs/2606.01635
作者: Liu Qing,Ou Wu,Yi Du
机构: Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences (杭州高等研究院,中国科学院大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce \textbfAlphaToken , a response token valuation framework that decouples valuation into \textbfadaptation (promoting target-task learning) and \textbfstability (preserving pre-trained capabilities), and makes each objective \textbfpath-aware by combining the direct-path signal from local token gradients with the downstream causal-path signal in autoregressive generation. Since retention data are typically unavailable, AlphaToken approximates stability via a \textbfFisher-drift proxy anchored at the pre-trained reference model. For efficient computation, we extend Ghost Dot-Product to token-level valuation. AlphaToken masks low-value response tokens during fine-tuning and preference optimization, concentrating training signals on more valuable positions. Experiments show that AlphaToken improves post-training performance and mitigates catastrophic forgetting.
[NLP-94] Benchmarking LLM -as-a-Judge for Long-Form Output Evaluation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成输出评估中面临的可靠性挑战。随着LLM在长篇内容生成中的广泛应用,如何有效评估其生成质量成为关键问题,而现有基于“大语言模型作为评判者”(LLM-as-a-judge)的评估方法主要依赖于短文本评测基准,难以覆盖长文本生成所特有的复杂文档级需求。为应对这一问题,论文提出LongJudgeBench,一个涵盖多种真实场景与评判协议的综合性长文本评估基准,系统评估了多种基础模型及不同评判设置下的LLM评判者表现。研究发现,当前的LLM评判者在跨场景评估中仍存在显著的不稳定性,尽管使用评分标准或参考答案可提升评估质量,但其作用有限且非普适。因此,该研究的关键突破在于构建了一个能够揭示评估系统内在脆弱性的基准工具,并强调未来需发展更鲁棒、具备上下文感知能力且与人类判断对齐的LLM-as-a-judge方法。
链接: https://arxiv.org/abs/2606.01629
作者: Junjie Chen,Yuxi Dong,Haitao Li,Weihang Su,Yujia Zhou,Min Zhang,Yiqun Liu,Qinyao Ai
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at this https URL.
[NLP-95] EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在专业化、高风险领域中因标注成本高昂而表现逊于小型监督模型的问题。针对这一挑战,其核心解决方案是提出EvoPool——一种受达尔文进化论启发的演化多智能体框架。其关键在于:通过三个专业智能体迭代生成可执行的标注器代码,利用小规模验证集提供适应度信号,并通过确定性门控机制筛选出满足生存力、多样性及边际贡献要求的标注器;同时,EvoAgg作为文本感知聚合器,结合语义特征与标注器投票特征,将多源标注结果映射为软标签。该方法实现了近零每样本成本的标注池构建,在10万样本规模下比传统LLM标注快4500至31000倍。在8个复杂且专业化任务(涵盖生物医学关系抽取、法律条款分类、复杂推理及密集多标签生物医学分类)中,EvoPool在7个任务上超越最强的LLM标注基线,平均宏F1提升0.141,最高达ChemProt任务上的+0.301和PubMed任务上的+0.265。
链接: https://arxiv.org/abs/2606.01617
作者: Tianyi Xu,Yaolun Zhang,Xuan Ouyang,Huazheng Wang
机构: Oregon State University (俄勒冈州立大学); University of Wisconsin–Madison (威斯康星大学麦迪逊分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures. Code: this https URL
Abstract:Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: this https URL
[NLP-96] RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
【速读】: 该论文旨在解决当前视频世界模型(Video World Models)在机器人操作任务中缺乏对可信度评估的问题,尤其针对现有基准测试仅在合理、可行且安全的指令下进行评估的局限性。为填补这一空白,研究提出RoboTrustBench,一个涵盖四种情境(正常、约束敏感、反事实、对抗性)的综合性基准,基于真实世界DROID数据集中的1,207个经专家验证的指令-图像对构建,并采用包含13项细粒度指标的六维评估协议。其解决方案的关键在于:通过引入更具挑战性的非理想场景(如违反物理规律或安全规则的指令),系统性检验模型在约束推理、反事实推理、物理交互建模及危险指令抑制等方面的能力。实验表明,尽管当前主流视频世界模型在视觉连贯性上表现良好,但在深层次语义理解与安全可控行为生成方面仍存在显著缺陷,揭示了仅依赖视觉质量与表面指令遵循无法保障模型的可信性,强调了构建具备因果推理与安全意识能力的视频世界模型的重要性。
链接: https://arxiv.org/abs/2606.01600
作者: Huiqiong Li,Jiayu Wang,Zhiting Mei,Anirudha Majumdar,Jingjing Chen,Bin Zhu
机构: Singapore Management University(新加坡管理大学); Fudan University(复旦大学); Princeton University(普林斯顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Robotics (cs.RO)
备注: Project: this https URL
Abstract:Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sensitive, Counterfactual, and Adversarial. Built from real-world DROID episodes, RoboTrustBench contains 1,207 expert-validated instruction-image pairs and a six-dimensional evaluation protocol with 13 fine-grained criteria. Evaluating seven representative video world models with human and MLLM assessment, we find that current models often generate visually coherent videos, but struggle with constraint reasoning, counterfactual grounding, physical interaction, and unsafe-instruction suppression. These results show that visual quality and surface-level instruction following are insufficient for trustworthy robotic video world modeling.
[NLP-97] Identifying High-Confidence Social Biases in LLM s for Trustworthy Conversational Tutoring Agents
【速读】: 该论文旨在解决生成式 AI(Generative AI)在对话式辅导系统中可能隐匿并放大社会偏见的问题,尤其关注大语言模型(LLMs)在教育场景下对刻板印象偏见识别能力不足且表现出过度自信的现象。其核心挑战在于:尽管现有基准评估显示模型具备一定偏见检测能力,但在真实、动态的师生交互情境中,模型常无法识别出明显存在的偏见,却仍以高置信度做出错误判断,从而影响其推理过程与对学生提供的反馈质量。该研究的关键解决方案是提出一种新的数据生成方法,通过重构学生-智能导师对话,并引入基于基准数据集控制性注入的偏见话轮,实现自然化教学条件下的偏见评估。结合计算与人工评估,研究揭示了当前先进大模型在对话辅导场景中对刻板偏见的识别能力显著下降,且其高置信度表现严重误导反馈逻辑,凸显了过自信、有偏见行为在教育AI应用中的潜在风险。
链接: https://arxiv.org/abs/2606.01584
作者: Aitor Arronte Alvarez,Naiyi Xie Fincham
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for AIED 2026
Abstract:Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particular risks in educational settings. In this study, we evaluate LLMs in conversational tutoring scenarios to identify high-confidence social biases, instances where models are unable to identify biased judgments in tutoring conversations while maintaining strong confidence in their assessments, potentially affecting their reasoning and the feedback they provide to learners. We present a new dataset generation method that enables bias evaluation under naturalistic instructional conditions by regenerating student-AI tutor interactions and introducing turns with controlled bias derived from a benchmark dataset. Using this data, we assess multiple LLMs’ ability to detect stereotypical biases and analyze the confidence and reasoning underlying their responses through computational and human evaluations. We find that bias detection is substantially more challenging in conversational tutoring contexts than in benchmark-based evaluations, and that state-of-the-art LLMs are overconfident in their incorrect assessments of stereotypical bias statements. Moreover, model confidence strongly influences reasoning and feedback, highlighting the risks of overconfident, biased behavior in LLM-based tutoring agents. We conclude by discussing implications, mitigation considerations, and directions for future research. Comments: Accepted for AIED 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.01584 [cs.CL] (or arXiv:2606.01584v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.01584 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-98] Defenses Enablers For Skill Injection Attacks on Terminal Based Agents DATE
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)代理在依赖可重用技能(reusable skills,即描述特定任务流程的文档)时所引入的新攻击面问题。随着技能文件成为代理执行任务的核心资源,恶意攻击者可通过篡改或诱导性表述操纵这些文件,从而实现非预期行为。为应对这一威胁,研究提出了两种互补的防御策略:一是静态守护机制(static guardian),在构建阶段预处理并重写技能文件;二是动态守护机制(dynamic guardian),通过一个中介型LLM代理实时干预对技能文件的访问。实验表明,这两种守护机制在三种不同架构的LLM代理中均能将攻击成功率(Attack Success Rate, ASR)降低超过50%,同时有效保持任务实用性。进一步地,研究通过四类攻击重构(attack reframing)测试了防御鲁棒性——此类攻击保留恶意指令语义但改变表述方式,以绕过基于规则或模式匹配的防御。在无守护机制情况下,重构攻击使ASR上升至81.4%,而动态守护机制将其降至18.6%,证明实时中介审查具备更强的抗攻击能力,是保障技能安全性的关键所在。
链接: https://arxiv.org/abs/2606.01567
作者: Yoshinari Fujinuma,Varun Gangal,Traian Rebedea,Makesh Narasimhan Sreedhar,Prasoon Varshney,Rebecca Qian,Anand Kannappan
机构: Patronus AI; NVIDIA
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: First version, small updates and clarifications likely in v2
Abstract:Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian-based defenses: an intermediary LLM agent that acts as a mediator for skill file access (dynamic guardian) or pre-rewrites these files at build time (static guardian). Across three LLM agent families, our guardians cut attack success rate (ASR) by well over half while preserving task utility. Second, we stress test them through attack reframing using four attacks that preserve the malicious instruction but change the phrasing. For non-guardian setup, the reframing pushes the ASR up to 81.4%, but the dynamic guardian brings it down to 18.6%, showing that real-time mediation is a robust defense.
[NLP-99] Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense
【速读】: 该论文旨在解决高风险企业文档生成场景中面临的多维度合规性与系统效率难题,具体包括:在金融纠纷叙述、合规通知及审计摘要等任务中,确保输出内容在结构模式(schema)、政策合规性(policy compliance)和低延迟(low-latency)方面的严格要求。传统系统依赖独立的敏感信息识别(PII redaction)、内容审核(content moderation)与格式校验(format validation)模块串联执行,导致逻辑碎片化、请求路径冗长且运维成本高昂。其核心解决方案是提出一种面向文本与图像输入的统一守卫层(guardrail orchestration layer),通过多候选生成(multi-candidate generation)与显式合规评分(compliance score)机制实现早期退出(early exit)。该框架支持可配置的并行生成头(generation heads),基于加权守卫规则(包括PII检测、内容审核、结构约束与领域规则)对多个候选输出进行评分,并返回得分最优的结果及其选择元数据。实测数据显示,系统可在20秒内完成5次尝试,整体合规率达91%。通过对支付争议辩护摘要的聚合运营读数分析(非随机化A/B测试),变量组在总体胜率上显著优于对照组(301/659 vs. 536/1548),提升11.0个百分点(95%置信区间[6.6, 15.5],p < 0.001);针对调整后的“未收到物品”案例,提升7.5个百分点(95%置信区间[0.2, 15.7],p = 0.045)。此外,系统还提供了由评审员校准的负责任人工智能(Responsible-AI)证据质量信号(基于770条生成证据评审与70例OCR样本分析),并明确界定请求接口、评分逻辑、伪代码及操作证据边界,保障系统的可复现性。
链接: https://arxiv.org/abs/2606.01513
作者: Nataraj Agaram Sundar,Tejas Morabia
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test
Abstract:High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary. Comments: 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.01513 [cs.DC] (or arXiv:2606.01513v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2606.01513 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-100] On the Limits of Token Reduction for Efficient Unified Vision Language Training
【速读】: 该论文旨在解决统一视觉-语言模型(Unified Vision-Language Models, VLMs)在联合训练过程中计算开销巨大且效率优化研究不足的问题。其核心挑战在于如何在保持模型性能的同时,实现高效训练。解决方案的关键在于通过分层注意力分析揭示了视觉理解与视觉生成任务在深层特征依赖上的根本不对称性:视觉理解存在显著的后期层视觉冗余,而视觉生成则在整个网络深度中持续依赖图像标记(image tokens)。基于此发现,作者设计了针对不同任务特性的令牌缩减加速器,分别对两类任务进行选择性图像令牌计算削减。然而,实验表明,在统一训练场景下,这种任务特异性令牌丢弃策略会导致参数路径分化,破坏任务间的协同增益,从而引发系统性性能损失。因此,论文指出高效统一建模的关键并非孤立优化各任务效率,而是必须保留跨任务共享的结构以维持协同效应,强调未来需发展具备协同感知能力的加速策略。
链接: https://arxiv.org/abs/2606.01503
作者: Siyi Chen,Weiming Zhuang,Jingtao Li,Lingjuan Lv
机构: University of Michigan (密歇根大学); Sony AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Unified vision-language models (VLMs) integrate visual understanding and visual generation within a single autoregressive backbone, but their joint training is computationally expensive and largely overlooked from an efficiency perspective. In this work, we study the feasibility and limits of token-reduction-based acceleration for unified VLM training. Through a systematic analysis of layerwise attention allocation, we uncover a fundamental asymmetry: visual understanding exhibits substantial late-layer visual redundancy, whereas visual generation maintains persistent dependence on image tokens across depth. Guided by this observation, we design task-specific accelerators that selectively reduce image-token computation for each objective. While these methods achieve significant efficiency gains in isolated settings, we observe a consistent synergy loss under unified training – task-specific token dropping necessitates divergent parameter pathways and eliminates the mutual performance gains typically observed in joint optimization. Our findings suggest that efficient unified modeling requires preserving shared cross-task structures, highlighting the need for synergy-aware acceleration strategies. Project page: this https URL.
[NLP-101] meSage-MT: A Multi-Turn Benchmark for Evaluating Agent ic Time Series Reasoning
【速读】: 该论文旨在解决当前大型语言模型(LLM)代理在多轮对话中进行可靠时间序列分析的能力不足问题,尤其针对现实场景下用户目标动态演变、分析需基于历史结论累积且决策依赖于逐步证据推演的复杂工作流。现有基准普遍局限于单步任务(如预测或异常检测),无法评估代理在持续交互中维持记忆、处理不确定性及实现领域适配性决策的能力。其解决方案的关键在于提出TimeSage-MT——一个面向代理式时间序列推理的多轮对话基准,涵盖240个任务与2,680轮对话,覆盖8个真实世界领域,从基础探索延伸至决策导向分析;该基准通过可复现的流水线将真实时间序列数据转化为带可验证答案的多轮对话,并提供统一评估协议与公开排行榜。实验表明,前沿大模型在决策类任务上性能显著下降,主要源于记忆保持、不确定性管理及领域驱动决策能力的缺失,凸显当前代理推理中的关键短板,为未来系统研发提供了严谨的评测基础。
链接: https://arxiv.org/abs/2606.01498
作者: Yaxuan Kong,Qingren Yao,Yuqi Nie,Yichen Li,Yilei Shao,Stefan Zohren,Anna Vettoruzzo,Joaquin Vanschoren,Ming Jin,Qingsong Wen
机构: University of Oxford (牛津大学); VulpiVox Intelligence (VulpiVox智能); Eindhoven University of Technology (埃因霍温理工大学); Griffith University (格里菲斯大学); Squirrel Ai Learning (松鼠AI学习); East China Normal University (华东师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series data inform critical decisions across many real-world domains. While large language model (LLM) agents can analyze data through natural language and tools, it remains unclear whether they can conduct reliable time series analysis across multi-turn conversations. Existing benchmarks focus on single-step tasks such as forecasting and anomaly detection, overlooking practical workflows where user goals evolve, agents must build on prior analyses, and conclusions emerge from accumulated evidence. In this work, we introduce TimeSage-MT, a multi-turn benchmark for agentic time series reasoning with 240 tasks and 2,680 dialogue turns across 8 real-world domains, spanning basic exploration to decision-oriented analysis. TimeSage-MT is built through a reproducible pipeline that converts real-world time series data into multi-turn conversations with verifiable answers. It provides a unified evaluation protocol and public leaderboard for comparing time series agentic systems. To demonstrate the benchmark’s utility, we evaluate frontier LLMs alongside TimeSage, a novel structured agent equipped with a comprehensive time series skill library. The results show sharp performance drops on decision-oriented tasks, driven by failures in memory, uncertainty handling, and domain-based decision making. TimeSage-MT exposes critical gaps in current agentic reasoning and provides a rigorous foundation for future development.
[NLP-102] CART: Context-Anchored Recurrent Transformer – A Parameter-Efficient Architecture with Learned Stability
【速读】: 该论文旨在解决大模型参数效率与长序列建模能力之间的权衡问题,即在保持模型性能的同时降低参数量并提升推理效率。其核心挑战在于如何在不显著增加参数量的前提下,实现对长上下文的有效建模,同时避免传统循环结构中因重复计算键值(K/V)而导致的冗余与不稳定。解决方案的关键是提出一种名为CART(Context-Anchored Recurrent Transformer)的新型架构:通过引入一个共享的核心块(core block),在深度方向上重复使用R次,仅需一次从多层预置阶段(prelude)生成固定的键值张量,并通过多头潜在注意力(multi-head latent attention)使递归核心与这些冻结的上下文锚点进行交互;同时,采用可学习的线性时不变(LTI)门控机制控制递归稳定性,其谱半径(spectral radius)在所有36个训练配置中均收敛于[0.79, 0.83]的狭窄区间,确保了训练过程中的数值稳定。尽管该设计在参数效率方面表现出色,但在参数匹配条件下,其性能仍落后于密集基线模型约1-2%(存储参数等价),甚至在有效参数等价下差距扩大至约10%,诊断性消融实验表明该差距主要源于权重共享(~5%)和非均匀结构框架(预置/锚点/核心/尾部)带来的异质性(~5%),而递归核心内部机制(超连接、LTI门、循环索引嵌入)本身贡献微弱,属可忽略的附带效应。此外,可变循环次数的推理测试显示,性能在训练最优的R值两侧均下降,表明该方法在测试时动态扩展深度方面存在局限。
链接: https://arxiv.org/abs/2606.01495
作者: Chad A. Capps
机构: Independent Researcher
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 31 pages, 4 figures. Code, training scripts, and the full experiment database ( this http URL ) are available at this https URL
Abstract:We present CART (Context-Anchored Recurrent Transformer), a parameter-efficient language model that reuses a single shared core block R times across depth. Unlike prior looped transformers that recompute key-value tensors at every iteration, CART computes K and V once from a multi-layer prelude and has the recurrent core cross-attend to those frozen tensors via multi-head latent attention. A learned Linear Time-Invariant (LTI) gate keeps the recurrence stable: its spectral radius settles in a narrow band (rho in [0.79, 0.83]) across all 36 fully-trained configurations. We evaluate CART on single consumer GPUs in two stages: a 64-configuration screen at 3,000 steps, then 36 configurations (P=6, R in 6,8,10, three seeds) trained for 30,500 steps (~1B tokens). Two patterns hold across widths d in 256,512,768,1024: prelude depth P dominates loop count R, and the Stage-1 ranking of R reverses at full training (R=6 becomes best at d=512). At the binding d=1024 parameter-parity test, CART does not beat a parameter-matched dense baseline, losing by 1-2% at stored-parameter parity and by ~10% at effective-parameter parity. Diagnostic ablations split the effective-parameter gap into ~5% from weight sharing and a residual ~5% from the heterogeneous prelude/anchor/core/coda framing; the recurrent-core machinery (hyper-connections, LTI gate, loop-index embedding) is individually vestigial. Variable-R inference degrades on both sides of the trained R, a negative result for test-time depth scaling under this recipe. Comments: 31 pages, 4 figures. Code, training scripts, and the full experiment database (this http URL) are available at this https URL Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2606.01495 [cs.LG] (or arXiv:2606.01495v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.01495 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-103] Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG
【速读】: 该论文旨在解决检索增强生成(RAG)系统中事实性(factuality)与可解释性(interpretability)不足的核心问题。其解决方案的关键在于提出对比证据理由注意力(CERA)框架,首次引入基于主观性的硬负样本选择机制,并通过辅助的注意力对齐损失(attention alignment loss)在对比学习中注入证据性归纳偏置(evidential inductive bias)。CERA通过双重训练目标——基于三元组的对比学习与可解释的注意力对齐——对密集检索器进行微调,其中后者利用人类标注的事实性理由(factual rationales)构建词性加权掩码分布,监督CLS到令牌的注意力分配。实验结果表明,基于主观性的硬负样本选择显著优于Contriever及传统硬负样本基线;同时,理由对齐在保持竞争性检索性能的前提下提升了模型输出的忠实性,验证了在人类理由引导下注意力可作为更可信的模型行为解释。CERA突破了传统以主题相似性为导向的检索范式,使检索器能够精准识别支撑证据的具体标记(token),从而推动RAG系统中证据选择的可解释性提升。
链接: https://arxiv.org/abs/2606.01482
作者: Francielle Vargas,João Robiatti,Diego Alves,Lucas Pascotti Valem,Maximilian Seeth,Sebastián Ferrada,Ameeta Agrawal,Daniel Pedronette,André Freitas
机构: University of Chile (智利大学); São Paulo State University (圣保罗州立大学); Saarland University (萨尔兰大学); University of São Paulo (圣保罗大学); University of Munich (慕尼黑大学); Portland State University (波特兰州立大学); Idiap Research Institute (Idiap研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Ensuring factuality and interpretability in RAG remains an open and urgent problem. We introduce Contrastive Evidence Rationale Attention (CERA), the first retrieval framework to employ subjectivity-based hard negative selection and inject an evidential inductive bias into contrastive learning through an auxiliary attention alignment loss. CERA fine-tunes a dense retriever using two training objectives: triplet-based contrastive learning and interpretable attention alignment, which supervises CLS-to-token attention using a part-of-speech-weighted masking distribution over human-annotated factual rationales as evidence signals. Experiments on a large corpus of clinical trial reports demonstrate that the subjectivity-based hard negative selection substantially improves retrieval effectiveness compared to both Contriever and hard negative selection baselines. Furthermore, rationale alignment improves faithfulness while maintaining competitive retrieval performance, supporting the hypothesis that attention can serve as a more faithful explanation of model behavior when guided by human rationales. Moving beyond topical similarity, CERA enables the retriever to identify the specific tokens that constitute supporting evidence, promoting more interpretable evidence selection in RAG systems.
[NLP-104] Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech ICML2026
【速读】: 该论文旨在解决生成式语音合成(Text-to-Speech, TTS)系统中情感控制的可解释性问题。尽管大语言模型(Large Language Models, LLMs)的引入提升了语音表达的丰富性,但现有方法主要依赖外部条件输入或全局激活调节(global activation steering),难以深入揭示情感控制背后的内部表征机制。本文的关键解决方案是利用稀疏自编码器(Sparse Autoencoders, SAEs)分析基于LLM的TTS模型语义隐藏状态中的情感相关变化,识别出分布于多个稀疏潜在特征(sparse latent features)中的情感信息。研究发现,仅干预一小部分关键潜在特征即可实现可解释的情感调控。基于此,提出一种无需修改主干参数的特征级干预框架,支持双向情感诱导与抑制。进一步实验表明,不同潜在特征与特定声学属性(如音高)相关联,说明情感表达是由多个潜在特征协同作用的结果,而非单一全局偏移。实证结果表明,通过调控这些稀疏潜在特征,在情感诱导与抑制性能上优于全局调节及现有TTS基线方法。
链接: https://arxiv.org/abs/2606.01479
作者: Hongfei Du,Jiacheng Shi,Sidi Lu,Gang Zhou,Ye Gao
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by ICML 2026
Abstract:Integrating large language models (LLMs) into text-to-speech (TTS) systems has improved speech expressiveness, yet interpretable emotional control remains challenging. Existing approaches primarily rely on external conditioning or global activation steering, offering limited insight into the internal representations underlying emotional control. In this work, we analyze emotion-related variation in the semantic hidden states of LLM-based TTS models using sparse autoencoders (SAEs) to identify sparse latent features. Our analysis shows that emotional variation is distributed across multiple sparse latent features, while intervening on a small subset enables interpretable emotion control. Building on this observation, we introduce a feature-level intervention framework for bidirectional emotion induction and suppression without modifying backbone parameters. We further show that distinct latent features are associated with specific acoustic attributes (e.g., pitch), suggesting that emotional expression arises from coordinated latent contributions rather than a single global shift. Empirically, steering these sparse latent features achieves comparable or superior emotion induction and suppression performance relative to global steering and existing TTS baselines.
[NLP-105] OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
【速读】: 该论文旨在解决生成式 AI(Generative AI)中在策略蒸馏(On-Policy Distillation, OPD)框架下存在的两个关键问题:一是标准OPD依赖教师模型的逐标记(token-level)logits,限制了无法提供内部logits的闭源强模型作为教师的应用;二是逐标记logits信号本身具有脆弱性,对教师与学生间下一标记的语义重叠敏感,易放大重复循环等退化模式。为应对上述挑战,本文提出OmniOPD框架,其核心创新在于采用无logits、基于块级(chunk-level)的监督信号,通过蒙特卡洛滚动(Monte Carlo rollouts)结合连续语义相似度度量来近似教师的局部偏好,并利用峰值熵调度器仅在学生推理不确定性高的分支处施加监督,从而提升学习信号的鲁棒性。此外,引入狄利克雷-多项式贝叶斯先验与基础模型KL锚点,有效控制离散采样的方差并防止未审计标记处的策略坍缩。实验表明,OmniOPD在数学类基准上相较标准OPD最高提升28.64%,且当与Claude-4.5-Haiku和Gemini-2.5-Flash等黑盒强教师结合时,相对开放权重教师进一步提升9.54%,显著超越自探索强化学习方法,验证了块级语义验证相比高密度但脆弱的逐标记匹配能提取更可靠的学习信号。
链接: https://arxiv.org/abs/2606.01476
作者: Yuhang Zhou,Lizhu Zhang,Yifan Wu,Mingyi Wang,Peng Bo,Jiayi Liu,Xiangjun Fan,Zhuokai Zhao
机构: Meta AI(元宇宙人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 26 pages, 3 figures
Abstract:On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher’s token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher’s local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.
[NLP-106] Peacemaker at ATE-IT: Automatic term extraction from Italian text for waste management data using encoder model
【速读】: 该论文旨在解决自动术语提取(Automatic Term Extraction, ATE)在实际应用中面临的两大核心问题:一是标注文档数量有限导致模型训练困难,二是跨领域场景下多词表达(multi-word expressions)提取的复杂性增加。针对上述挑战,本文提出了一种低成本且可解释性强的自动术语提取方法,专为ATE共享任务中的Task A设计。其解决方案的关键在于采用经过微调的提取策略,能够在计算资源受限的条件下高效运行,同时兼顾提取结果的准确性和可解释性。实验结果表明,该方法在类型级(type-level)和微观级(micro-level)的精确率、召回率与F1分数上均表现出一致且均衡的性能,优于多数参赛团队。尽管方法本身相对简单,但其良好的基础性能为低资源环境下的模型发展提供了有效起点,并为未来在保持可解释性前提下实现更高性能的模型扩展奠定了可能性。
链接: https://arxiv.org/abs/2606.01469
作者: Mahdi Bakhtiyarzadeh,Hadi Bayrami Asl Tekanlou,Jafar Razmara
机构: University of Tabriz (Tabriz大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195
Abstract:The development of automatic term extraction has become increasingly important in modern technology. Automatic term extraction can be found in virtually every search engine that is currently available to users. Recent advancements have provided promising results for the extraction of automatic terms; however, accurate labeling is difficult because of several factors, such as the limited number of annotated documents available for training and the complexity of extracting multi-word expressions due to shifts in the domain. In this paper, we will present a low-cost and interpretable method of automatic term extraction, developed specifically for Task A of the ATE Shared Task. This new method utilizes fine-tuning extraction strategies that can run on a small amount of computational resources. We evaluated our automated system using both type-level and micro-level measures of precision, recall, and F1-score to measure both complementary aspects of the extraction performance. According to the experimental results, our proposed approach achieves consistent and balanced performance compared to other teams. Even though the technique itself is relatively straightforward, it serves as a good starting point for low-resource models. Overall, the findings point toward the possibility of significant future advancements (in model expansion) with higher-level performance still able to retain their ability to be interpreted.
[NLP-107] Cross-lingual Self-Consistency for Multilingual Reasoning with Language Models
【速读】: 该论文旨在解决大语言模型(LLM)在多语言场景下推理能力严重受限于高资源语言(如英语)的问题,尤其针对低资源语言缺乏高质量标注数据和跨语言泛化能力弱的挑战。其核心解决方案是提出一种无监督强化学习(unsupervised Reinforcement Learning, RL)框架,通过强制跨语言自洽性(cross-lingual self-consistency)来提升多语言推理能力——即要求模型在不同语言表达的等价问题上产生一致的最终答案。该方法不依赖黄金答案或平行语料,仅利用模型自身生成的一致性信号进行优化,在MGSM基准上实现了跨10种语言平均21.7%的性能提升,并在未见语言上实现18.2%的平均增益,以及在3个分布外评测集上最高达6.2%的改进,验证了基于一致性约束的方法在无需监督数据条件下有效扩展大模型多语言推理能力的潜力。
链接: https://arxiv.org/abs/2606.01464
作者: Ahmed Elhady,Eneko Agirre,Mikel Artetxe
机构: HiTZ Center, University of the Basque Country (UPV/EHU); Reka AI
类目: Computation and Language (cs.CL)
备注: Paper under review
Abstract:Despite expanding their multilingual coverage, the advanced reasoning capabilities of LLMs remain largely confined to a few high-resource languages like English. To address this, we propose an unsupervised Reinforcement Learning (RL) approach to enhance multilingual reasoning by enforcing cross-lingual self-consistency: the principle that a model should produce the same final answer for equivalent problems in different languages. Existing methods are limited by the scarcity of multilingual reasoning data and show weak generalization to unseen languages. Our approach requires neither gold answers nor parallel data, and it achieves average gains of up to 21.7% on MGSM across 10 languages. In addition, our method demonstrates strong generalization, with an 18.2% mean improvement on MGSM languages unseen during training, and up to 6.2% gain on 3 out-of-distribution benchmarks. These results show the potential of consistency-based methods to improve the multilingual capabilities of LLMs without requiring supervised data.
[NLP-108] An Enigma of Artificial Reason : Investigating the Production-Evaluation Gap in Large Reasoning Models
【速读】: 该论文旨在解决大模型在推理评估能力上的显著短板问题,即尽管生成式推理模型(Generative Reasoning Models, GRMs)在生成复杂问题的长链推理过程方面表现卓越,但在对已有推理链条进行有效性评估时却存在严重缺陷。其核心问题是:当前主流的推理训练范式使模型更擅长“生成”推理路径以达成正确答案,而非“严谨评估”推理过程的逻辑合理性。解决方案的关键在于通过构建专门设计的 Valid-Answer-Invalid-Reasoning (VAIR) 数据集,将推理评估与推理生成任务解耦,从而精准识别模型在评估环节的系统性偏差。研究发现,前沿模型在评估具有有效答案但存在细微推理错误的解题方案时,准确率低至48%,远低于其接近100%的推理生成能力,揭示出明显的“生成-评估差距”。进一步的链式思维(Chain-of-Thought, CoT)分析和线性探测表明,模型普遍存在“答案确认偏差”——即优先验证最终答案是否正确,而非逐步审查推理逻辑,甚至在察觉异常推理后仍会编造合理化解释。因果修补实验进一步证实,模型的判断高度依赖于最终答案的表征,一旦修改答案表示,模型的评估结论与内部激活状态即发生反转。这说明现有推理训练机制过度激励模型以正确答案为导向进行推理生成与自我确认,而忽视了对推理过程本质有效性的鲁棒评估。该研究揭示了当前主流推理训练方法的根本局限,并呼吁发展更强调逻辑可解释性与推理独立验证的新范式。
链接: https://arxiv.org/abs/2606.01462
作者: Mingzhong Sun,Teresa Yeo,Armando Solar-Lezama,Tan Zhi-Xuan
机构: NUS Department of Computer Science (新加坡国立大学计算机科学系); MIT EECS (麻省理工学院电子工程与计算机科学系); A*STAR (新加坡科技研究局); Singapore-MIT Alliance for Research and Technology (SMART) (新加坡-麻省理工学院研究与技术联盟)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)
Abstract:Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer’s representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models’ confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons. Comments: 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables) Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.01462 [cs.AI] (or arXiv:2606.01462v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01462 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-109] ruthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在偏好不一致情境下作为顾问时的诚实性对齐问题,即当模型自身利益与用户利益相冲突时,其是否仍能保持信息传递的真实性。核心挑战在于评估模型在缺乏激励约束下的策略性沟通行为是否符合博弈论中“廉价言论”(cheap-talk)理论所预测的最优战略均衡。研究将经典的Crawford-Sobel廉价言论模型转化为一个预设基准,用于量化评估LLM在不同偏置水平下的信息揭示程度。其解决方案的关键在于构建一个系统性的实验框架:设置5种偏置水平、3种提示模板、固定低温度采样,并在每个组合下生成200个状态(共12,000次发送者调用),以精确测量模型输出的信息量与理论最优解之间的差距。实验结果表明,所有四款指令微调模型(GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B)均显著过度披露信息,其归一化互信息达0.78–0.94,远高于理论最优值0.18–0.53,且呈现线性夸大特征,而非理论预期的粗粒度单调分段结构。此外,收益最大化与诚实性提示框架对结果影响甚微,而解码器分析显示,仅当接收方正确解析发送方声明数值时,该现象可复现;若采用仅嵌入解码器,则模型输出被误读为近乎无意义的“胡言乱语”,说明信息理解依赖于显式数值处理能力。这揭示了当前主流模型在战略沟通中存在严重的非理性过度透明倾向,其根本原因可能源于训练目标与真实博弈策略之间的脱节。
链接: https://arxiv.org/abs/2606.01456
作者: Hamidreza Hasani Balyani,Seyed Pouyan Mousavi Davoudi,Alireza Amiri-Margavi,Amin Gholami Davodi,Arshia Gharagozlou
机构: Amazon Lab126, HW Tech Org.(亚马逊实验室126,硬件技术部门); University of Pittsburgh(匹兹堡大学); University of Minnesota Duluth(明尼苏达大学杜鲁斯分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注: 19 pages. Code and data: this https URL
Abstract:Large language models are increasingly deployed as advisors whose objective is not aligned with the user’s: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver’s action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in 0.01,0.04,0.08,0.12 the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender’s stated number: an embedding-only decoder mis-reads the same data as near-babbling.
[NLP-110] Before and After Temperature: A Distributional View of Creative LLM Generation
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)创造力的无参考评估(reference-free evaluation)问题,即在缺乏人工标注或标准答案的情况下,如何准确衡量生成文本的创造性。传统方法依赖困惑度(perplexity)、熵(entropy)和top-1置信度差距(top-1 margin)等指标,但其预测能力有限。本文提出的关键创新在于:在采样温度(sampling temperature)对模型词元分布进行重塑造的阶段——即在下一个词元被采样之前——捕捉这一过程中的分布动态变化。研究发现,仅基于该阶段单个词元级别的特征即可实现对创意性排名的高精度预测(与GPT-4o/Gemini-2.5-Pro平均评分的斯皮尔曼相关系数ρ=0.918,与三名人类评审者多数意见的ρ=0.870),显著优于现有四种主流无参考基线方法(最大|ρ|≈0.76),性能差距达+0.165(对比平均模型评分)和+0.110(对比人类多数评分)。机制分析表明,这种优势源于温度升高至1.5时所引发的“不连贯态”(incoherence regime)的显著分布特征:累积质量宽度(n₉₅(q))从约1扩展至约131个词元,且超过预温度下前90%合理词元集合的质量流失约13个百分点。这说明词元级分布重塑信号是判断创造力的核心线索,而温度为0.8与0.3之间的区分则需依赖序列级特征。
链接: https://arxiv.org/abs/2606.01451
作者: V. S. Raghu Parupudi,Harsha Ponnada,Aditi Kaushal,S. Shria Parupudi,Saiteja Dasari,Sahiti Bulusu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to NGEN-AI 2026
Abstract:Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emphreshapes the model’s token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at T \in \0.3, 0.8, 1.5\ , a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman \rho=0.918 against an averaged gpt-4o,/,gemini-2.5-pro judge ( n=500 ) and \rho=0.870 against a three-rater human-majority ranking ( n=150 ). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at |\rho|!\approx!0.76 on both ground truths: a gap of +0.165 on averaged-LLM and +0.110 on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at \rho=0.83 , above the inter-human ceiling of \rho=0.77 , so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at T=1.5 the cumulative-mass width n_95(q) inflates from \sim!1 to \sim!131 tokens and post-temperature mass leaks off the pre-temperature top- 90% plausible set by about 13 percentage points. The per-token aggregates do not separate T=0.8 from T=0.3 ; discriminating the two coherent regimes is left to sequence-level features.
[NLP-111] Self-Revising Discovery Systems for Science: A Categorical Framework for Agent ic Artificial Intelligence
【速读】: 该论文旨在解决科学发现中“范式重构”(representational regime revision)的机制问题,即如何在不依赖主观新颖性判断的前提下,系统性地实现从旧知识体系到新知识体系的可验证过渡。其核心挑战在于区分单纯的信息检索、搜索与真正的科学发现——后者要求对表征框架本身进行形式化更新,并确保旧有证据与操作的可追溯性与一致性。解决方案的关键在于引入范畴论(category theory)作为形式化工具:在固定范式 $ b $ 下,系统状态由范畴 $ S_b $ 上的余预层(copresheaf) $ I_t: S_b \to \text{Set} $ 描述,而本体论(provenance)则由该余预层的元素范畴 $ \int_{S_b} I_t $ 给出。固定范式的操作被建模为保持本体论的自函子(endofunctor),而真正的发现则被定义为一种经验证的范式跃迁 $ u: S_b \to S_b’ $,其中旧有实体通过左肯兰延拓(left Kan extension, $ \text{Lan}_u I_t $)进行运输,并与跃迁后的状态对比以识别超出函子性传输的残余内容,从而实现客观的发现检测。该框架在两个系统中得到实例化:在 Builder/Breaker 中,基于最小描述长度(Minimum Description Length)门控机制对蛋白质力学世界模型进行修订,获得一种由慢集体模式参与条件化的全模式弹性柔度定律;在 CategoryScienceClaw 中,将类型化技能、实体、开放需求、工作流变异、门控机制、应力测试与公共讨论整合为携带证明的知识-计算图,以纤维网络为例记录候选模型、被拒方案、AIC 门控、微扰测试及最终接受的各向异性刚度代理模型,其输入为各向同性纤维计数描述符。两案例共同表明,范畴论不仅可作为科学发现的数学语言,亦可作为自修正生成式 AI 发现系统的工程规范。
链接: https://arxiv.org/abs/2606.01444
作者: Fiona Y. Wang,Markus J. Buehler
机构: Massachusetts Institute of Technology (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci); Computation and Language (cs.CL); Machine Learning (cs.LG); Category Theory (math.CT)
备注:
Abstract:Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b - Set, and provenance is the category of elements \int_S_b I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b - S_b’: old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.
[NLP-112] Learning from Saturated Data: Signals Beyond Correctness for LLM Training
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在现有基准测试和训练数据集上面临性能提升瓶颈的问题,即当任务被“饱和”时,仅依赖二元正确性(binary correctness)标签已难以有效提升模型下游性能。其核心解决方案在于引入更细粒度的质量信号以替代传统的二元判断,具体包括:(1)基于模型自我评估的成对质量比较(pairwise LLM self-judgments),即模型自身对不同解法结果进行相对优劣判断;(2)基于词元级熵(token-level entropy)的不确定性度量,将生成过程中的局部不确定性作为解题质量的代理指标。研究将这两种质量信号整合进多种训练算法中,并在Qwen3-1.7B-Base模型上进行评估。实验表明,在简单的算术任务上,基于质量信号的训练可使模型性能相比基线模型提升高达18.6%,显著优于标准微调(SFT);而在更复杂的GSM8K任务上,增益较为有限且高度依赖具体质量信号的选择——例如,自评结果与外部强裁判一致性较差,甚至可能导致性能退化。因此,研究结论指出,尽管质量信号能从已饱和的问答任务中提取有用信息以改进基础模型,但将其应用于复杂任务时仍需精细校准与深入研究。
链接: https://arxiv.org/abs/2606.01436
作者: Hanno Hiss,Jasper Dekoninck,Martin Vechev
机构: 未知
类目: Computation and Language (cs.CL)
备注: 25 pages, 5 figures
Abstract:The growing capabilities of large language models (LLMs) have led to the saturation of many benchmarks and training datasets used to improve them. Motivated by this, we investigate whether questions solved with perfect empirical accuracy can nevertheless be used to improve downstream performance. To do so, we replace binary correctness with two sources of more fine-grained quality signals: (1) pairwise LLM self-judgments, in which the model evaluates the relative quality of its own solutions, and (2) token-level entropy, where token-level uncertainty is used as a proxy for solution quality. We incorporate these signals into several training algorithms and evaluate them on Qwen3-1.7B-Base. When training exclusively on a simple arithmetic task, quality-based signals improve performance by up to 18.6% over the base model, substantially outperforming SFT. On GSM8K, however, gains are more modest and depend strongly on the quality signal. For instance, self-judgments show poor agreement with a stronger external judge and can even degrade performance below the base model. Overall, our results suggest that quality-based training can extract useful signal from saturated questions for base models, but that applying such signals to more complex tasks requires careful calibration and further study.
[NLP-113] DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering
【速读】: 该论文旨在解决药物信息问答(drug-information question answering)中因生成式模型幻觉(hallucination)导致临床决策误导的问题,尤其关注答案所依据事实的可追溯性与来源权威性。其核心挑战在于如何确保回答不仅准确,而且每一条引用事实均能追溯至原始监管文件或同行评审文献。为此,研究提出DrugClaw——一种多智能体检索增强系统,通过基于反思驱动的状态机工作流,从药物与药物流行病学技能注册库中动态查询,并返回基于一级权威源的可信答案。该方案的关键在于引入“反射式状态机”机制,使系统能够自主评估和修正推理路径,同时结合权威性感知的基准测试DrugAudit进行严格评估。DrugAudit包含3,772个样本,采用双评委大语言模型(LLM-as-judge)协议,对上游来源匹配度、语义片段重叠率及引用忠实性进行评分,其评委间一致性kappa值达0.88(近乎完美)。在涵盖DrugAudit以及MedQA和PubMedQA中的药物相关子集的综合评测中,DrugClaw在各项指标上均取得最优表现,包括复合证据指数、答案正确率、一级来源占比(0.918,优于次优模型10.1个百分点)、引用忠实性(0.887,提升5.9个百分点),以及在MedQA(0.920)和PubMedQA(0.693)上的领先性能,验证了其在高可靠性药物问答场景中的有效性。
链接: https://arxiv.org/abs/2606.01434
作者: Qing Wang,Bo Li,Jialu Liang,Daling Shi,Bob Zhang,Qianqian Song
机构: University of Florida (佛罗里达大学); University of Macau (澳门大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).
[NLP-114] Consistent and Distinctive: LLM Benchmark Efficiency via Maximum Independent Set Prompt Selection on Similarity Graphs
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在综合性基准测试中评估成本高、耗时长的问题。其核心挑战在于如何在保证评估结果代表性的同时,显著减少所需的测试提示(prompt)数量。为此,论文提出一种基于图结构的提示选择框架:将每个基准测试建模为一个相似性图,其中节点代表提示,若其嵌入空间距离超过可配置阈值则连边,进而通过最大独立集(Maximum Independent Set, MIS)算法选取一个具有最大多样性且无冗余的提示子集。该方案的关键在于利用MIS求解器从高维语义空间中筛选出覆盖广泛能力维度的代表性提示,从而实现高效且稳定的模型性能评估。实验表明,在多种嵌入模型、距离度量和阈值设置下,不同随机种子下的重复采样均能保持高度一致的模型排名(Kendall’s W ≥ 0.90 在99.2%的配置中成立,平均达0.997±0.008),且在较高百分位阈值下平均可减少25%–48%的提示数量;仅15.95%的配置出现与全基准对比的排名偏差(ρ < 0.95),主要集中在低阈值(p₁₀–p₂₀)和特定基准(GPQA、IFEval)上,揭示了图结构过密是导致性能失真的主要失败模式。
链接: https://arxiv.org/abs/2606.01400
作者: Denica Kjorvezir,Marko Djukanović,Ana Gjorgjevikj,Gjorgjina Cenikj,Tome Eftimov
机构: Jožef Stefan Institute (乔泽夫·斯特凡研究所); Jožef Stefan International Postgraduate School (乔泽夫·斯特凡国际研究生院); University of Nova Gorica (新戈里察大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Evaluating large language models (LLMs) across comprehensive benchmarks is expensive and time-consuming. We propose a graph-based prompt selection framework that models each benchmark as a similarity graph – nodes are prompts connected if their embedding-space distance falls above a configurable threshold – and applies Maximum Independent Set (MIS) algorithms to select a maximally diverse, non-redundant subset. We evaluate four MIS solvers (CPLEX, GREEDY, Online-MIS, ReduMIS) across six embedding models, three distance measures, six percentile thresholds, and four benchmarks (GPQA, IFEval, MMLU-Pro, Omni-MATH) covering 66 LLMs. Our central hypothesis – that repeated selection under different random seeds yields consistent LLM rankings that may also differ from the full-benchmark baseline – is strongly confirmed: Kendall’s W \geq 0.90 in 99.2% of stochastic configurations (mean W = 0.997 \pm 0.008 ), while at higher percentile thresholds selected subsets achieve 25–48% prompt reduction on average. Ranking divergence from the full benchmark ( \rho 0.95 ) occurs in only 15.95% of configurations, concentrated at low thresholds ( p_10 – p_20 ) and benchmarks (GPQA, IFEval), identifying overly dense graphs as the primary failure mode.
[NLP-115] UniD3: A Knowledge Graph-Enhanced RAG Framework for Drug-Disease Discovery and Reasoning
【速读】: 该论文旨在解决生物医药文献中药物-疾病关系知识提取与整合所面临的异质性高、增长迅速以及现有数据集依赖人工标注且不完整的问题。其核心挑战在于如何在保证知识准确性的同时,高效处理海量非结构化文献,并避免仅依赖大语言模型(LLM)所带来的幻觉和证据支撑薄弱等问题。解决方案的关键是提出UniD³框架,该框架通过融合大语言模型与知识图谱增强的检索增强生成(KG-RAG)技术,实现对药物-疾病匹配(DDM)、药物有效性评估(DEA)及药物靶点分析(DTA)三类知识的系统性提取、组织与验证。该方法采用双阶段策略:首先基于论文层级进行实体与关系抽取,再通过以药物和疾病实体为中心的知识图谱层级整合,构建高质量知识图谱;进而利用KG-RAG生成结构化数据集,并通过外部基准测试、模糊匹配已知资源及临床医生评审进行多维度验证。实验结果表明,UniD³成功构建了包含28,915条DDM、15,042条DEA及超过4,000条DTA问答对的大规模数据集,各项任务的F1值达0.85–0.87(DDM/DEA)和0.82(DTA),临床评审显示其可靠性高(AUROC = 0.90)。相比独立使用的LLM,KG-RAG增强模型表现更优,且配套的UniD³聊天机器人支持可解释、带引用依据的药物-疾病关系探索。因此,该研究提供了一个可扩展、可泛化的框架,能够将非结构化生物医学文献转化为高质量、结构化的药物-疾病知识,为人工智能驱动的药物发现、老药新用及精准医疗提供坚实基础。
链接: https://arxiv.org/abs/2606.01394
作者: Qing Wang,Tianshi Liu,Minghao Zhou,Jialu Liang,Sen Guo,Guangyu Wang,Jing Su,Qianqian Song
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Systematic characterization of drug-disease relationships is essential for drug discovery and repurposing, yet is hindered by the heterogeneity and rapid growth of biomedical literature. Existing datasets rely on labor-intensive curation and are often incomplete, while LLM-only approaches suffer from hallucination and weak evidence grounding. We introduce UniD ^3 , a unified framework that integrates Large Language Models with Knowledge Graph-enhanced Retrieval-Augmented Generation (KG-RAG) to extract, organize, and validate drug-disease knowledge across Drug-Disease Matching (DDM), Drug Effectiveness Assessment (DEA), and Drug-Target Analysis (DTA). UniD ^3 processes 157,849 PubMed articles with Llama 3.3-70B and constructs knowledge graphs via a dual-stage strategy combining paper-level extraction with KG-level consolidation centered on drug and disease entities. These graphs support KG-RAG-based generation of structured datasets, evaluated through external benchmarks, fuzzy matching with curated resources, and clinician review. UniD ^3 produces six knowledge graphs and large-scale datasets, including 28,915 DDM, 15,042 DEA, and over 4,000 DTA QA pairs. External validation shows strong performance (F1: 0.85-0.87 for DDM/DEA; 0.82 for DTA), with clinician review confirming high reliability (AUROC = 0.90). KG-RAG-augmented models outperform standalone LLMs, and the UniD ^3 chatbot enables interpretable, citation-supported exploration of drug-disease relationships. UniD ^3 provides a scalable, extensible framework for transforming unstructured biomedical literature into high-quality, structured drug-disease knowledge, supporting AI-driven discovery, repurposing, and precision medicine.
[NLP-116] Dr. DocBench: A Comprehensive Benchmark for Expert-Level and Difficult Document Parsing
【速读】: 该论文旨在解决现有文档解析与识别基准在覆盖范围和难度上的局限性问题,特别是针对专家领域复杂结构(如化学式、乐谱、复杂表格及跨页布局)的标注不足与评估能力缺失。其核心解决方案是提出Dr. DocBench——一个面向专家级文档解析的难度感知型基准。该基准基于大规模多语言图书语料库构建,涵盖52个BISAC主题领域,并采用基于解析器失败的采样策略筛选出高难度文档,聚焦于多个先进系统均表现不佳的场景。其包含4,514页长文档(平均约100页/篇),并提供6.5万条高质量的页面级与块级标注,涵盖版面布局、阅读顺序、层级关系及领域特定视觉内容等维度。实验表明,现有基准上表现优异的流水线解析器与通用视觉-语言模型(VLMs)在本基准上出现显著性能下降,揭示了当前方法在复杂结构理解方面的系统性缺陷,验证了Dr. DocBench作为诊断与推进文档智能发展的综合性测试平台的有效性。
链接: https://arxiv.org/abs/2606.01393
作者: Minglai Yang,Xinyan Velocity Yu,Pengyuan Li,Xinyu Guo,Zhenting Qi,Konwoo Kim,Longtian Ye,Xiaolong Luo,Jinhe Bi,Henry Zhang,Haris Riaz,Xuan Zhang,Yunze Xiao,Bangya Liu,Tom Tang,Yunfei Zhao,Qunshu Lin,Zihan Wang,Minghao Liu,Michael Lingzhi Li,Yilun Du,Jesse Thomason,Rogerio Feris,Alex Pentland,Zexue He
机构: 2077AI; Stanford University (斯坦福大学); MIT (麻省理工学院); Carnegie Mellon University (卡内基梅隆大学); University of Southern California (南加州大学); Harvard University (哈佛大学); IBM Research (IBM研究院); University of Arizona (亚利桑那大学); Duke University (杜克大学); UC Berkeley (加州大学伯克利分校); LMU Munich (慕尼黑路德维希马克西米利安大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 13 figures, 14 tables
Abstract:Document parsing and recognition are fundamental capabilities for vision-language models (VLMs) and document processing systems. However, existing Optical Character Recognition (OCR) and document parsing benchmarks are increasingly limited in coverage and difficulty: many focus on common document genres or uniformly sampled pages where modern parsers already perform strongly, while offering limited annotation for expert-domain structures such as chemical formula, music notation, complex tables, and cross-page layouts. We introduce Dr. DocBench, a difficulty-aware benchmark for expert-level document parsing. Built from a large-scale multilingual book corpus, Dr. DocBench spans 52 BISAC subject domains and selects challenging documents through parser-failure-based sampling, targeting cases where multiple state-of-the-art systems struggle. It contains 4,514 annotated pages from long documents averaging around 100 pages, with 65k high-quality page- and block-level annotations for layout, reading order, hierarchical relations, and domain-specific visual contents. Evaluations of pipeline-based parsers and general-purpose VLMs show that strong performance on existing benchmarks does not transfer to our expert-level document parsing. Our analysis reveals substantial failures across subjects, content types, and structural attributes, highlighting Dr. DocBench as a comprehensive testbed for diagnosing and advancing document intelligence.
[NLP-117] GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning
【速读】: 该论文旨在解决公共部门在部署生成式 AI 服务时面临的隐私与数据合规难题,尤其是在意大利公共行政(PA)体系中,受限于监管和组织约束,无法将分散在各地的敏感内部数据(如工单记录、官员手册、数据库提取物等)集中化处理。为应对这一挑战,提出基于联邦学习(Federated Learning, FL)的隐私保护聊天机器人 GuidaPA,其核心解决方案在于:通过在客户端本地进行安全预处理、结合基于角色的访问控制机制,并采用参数高效的微调方法(如QLoRA 4-bit)在多轮联邦学习中对大语言模型进行优化,同时显式监控非独立同分布(non-IID)数据带来的影响。实验表明,在15轮联邦训练下,模型在多个评估指标上接近私有集中式微调的表现(如ROUGE-1/2/L分别为61.10/55.77/59.44,BLEU-4为45.02,METEOR为63.94),且无需中央汇聚数据,验证了联邦学习在保障数据隐私前提下实现高质量领域对话系统的技术可行性。
链接: https://arxiv.org/abs/2606.01386
作者: Daniel M. Jimenez-Gutierrez,Albenzio Cirillo,Raffaele Nicolussi,Alessio Beltrame,Andrea Vitaletti
机构: University of Bologna (博洛尼亚大学); University of Milan (米兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)
Abstract:We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing
[NLP-118] FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting
【速读】: 该论文旨在解决长时序预测中轻量级模型在非平稳性条件下的建模瓶颈问题,具体表现为:传统可逆实例归一化(RevIN)仅依赖单一历史统计量进行全时域去归一化,难以应对非平稳数据;同时,时域趋势/季节性分解依赖固定且非自适应的滤波器,缺乏灵活性。其解决方案的关键在于提出FreqLite——一种超轻量、通道无关的频域分解线性预测器,通过可学习、无损且满足单位划分(partition-of-unity)的谱滤波器将输入信号分解为多个频带,各频带由独立的线性头进行预测,且保留高频分量而非像低通截断方法那样丢弃。此外,提出自适应可逆实例归一化(A-RevIN),该方法在非平稳场景下动态激活,严格推广了RevIN(当门控关闭时可精确恢复为原版),并在平稳数据上无损退化,显著提升对非平稳性的建模能力。实验表明,FreqLite在标准长时序预测基准上表现最优,在长回溯长度(L=336)下以4倍更少参数、2.2倍更低内存与计算开销,实现低于PatchTST Transformer的平均误差(0.3244 vs. 0.3587 MSE),且改进在配对威尔科克森检验中具有高度统计显著性(p < 1e-5)。所有组件均可独立消融验证,结果可在普通硬件上完全复现。
链接: https://arxiv.org/abs/2606.01339
作者: Mirza Samad Ahmed Baiga,Syeda Anshrah Gillani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注: 26 pages, 5 figures
Abstract:Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN’s benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.
[NLP-119] Benchmarking Local LLM s for Natural-Language-to-SQL Querying in Biopharmaceutical Manufacturing: An Empirical Benchmark on Consumer-Grade Hardware
【速读】: 该论文旨在解决在受监管的生物制药制造环境中,如何安全、合规地应用生成式人工智能(Generative AI)以支持自然语言到结构化查询(Natural Language to SQL, NLQ)任务的问题。由于美国食品药品监督管理局(FDA)指南、欧盟良好生产规范(GMP)以及欧盟人工智能法案等法规对数据隐私与系统可追溯性的严格要求,云部署的AI系统存在合规风险,因此亟需一种符合质量体系(GxP)要求的本地化解决方案。本文的关键解决方案是评估四款开源大语言模型(LLM)——Qwen 2.5 Coder 7B、Llama 3.1 8B、Mistral 7B 和 Meditron 7B——在本地通过Ollama部署时,针对制药制造数据库执行自然语言到SQL生成的能力。研究构建了一个基于FastAPI的评估平台PharmaBatchDB AI,使用包含约6.3万条记录的合成Microsoft SQL Server数据库进行测试,涵盖批次(Batch)、制造执行系统(MES)和在线清洗(CIP)模块。结果表明,代码优化的通用型模型(如Llama 3.1 8B和Qwen 2.5 Coder 7B)在结构化查询生成任务中表现优于专为生物医学领域设计的Meditron 7B,后者因上下文窗口限制和生成能力不足而几乎无法完成任务。尽管两者的性能差异不具统计显著性,但均显示当前本地部署的生成式AI系统虽可在消费级硬件上实现,仍需人工审核与下游验证才能满足监管环境下的使用要求。
链接: https://arxiv.org/abs/2606.01338
作者: Sagar Bhetwal,Rajan Bastakoti,Nirajan Acharya,Gaurav Kumar Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Biopharmaceutical manufacturing organizations operate under regulatory frameworks such as FDA guidance, EU Good Manufacturing Practice (GMP), and the EU AI Act, which can restrict the use of cloud-based artificial intelligence systems. Locally deployed large language models (LLMs) offer a privacy-preserving alternative, but their suitability for pharmaceutical manufacturing tasks remains underexplored. This study evaluates four open-source LLMs (Qwen 2.5 Coder 7B, Llama 3.1 8B, Mistral 7B, and Meditron 7B) deployed locally via Ollama for natural-language-to-SQL generation over a pharmaceutical manufacturing database. A FastAPI-based evaluation platform, PharmaBatchDB AI, was developed using a synthetic Microsoft SQL Server database containing approximately 63,000 records across Batch, Manufacturing Execution System (MES), and Clean-In-Place (CIP) modules. Models were benchmarked on 60 domain-specific natural-language questions using metrics including SQL extraction rate, SQL compliance, factual consistency, ROUGE-L, hallucination rate, throughput, and latency. Qwen 2.5 Coder 7B, Llama 3.1 8B, and Mistral 7B generated SQL for all evaluation tasks, while Meditron 7B failed on nearly all tasks due to context-window limitations and poor SQL generation capability. Llama 3.1 8B achieved the highest SQL compliance, whereas Qwen 2.5 Coder 7B achieved the strongest overall text similarity and factual consistency. Performance differences between the two leading models were not statistically significant. The results show that code-tuned general-purpose LLMs outperform a domain-specific biomedical model on structured query generation for pharmaceutical manufacturing data. Although fully local, GxP-aligned NLQ systems are feasible on consumer hardware, current performance levels still require human oversight and downstream validation for regulated use. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.01338 [cs.CL] (or arXiv:2606.01338v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.01338 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Gaurav Kumar Gupta [view email] [v1] Sun, 31 May 2026 16:41:26 UTC (34 KB)
[NLP-120] LongAttnComp: Cross-Family Context Compression for Long-Context Reasoning
【速读】: 该论文旨在解决大规模上下文(10万+ token)处理中上下文长度与推理效率之间的瓶颈问题,尤其针对代码推理等高要求长文本任务中现有无训练注意力压缩方法性能不足的缺陷。其解决方案的关键在于提出LongAttnComp,通过微调轻量级交叉注意力评分层,并引入基于令牌预算的top-p算法、分块策略、位置重排序以及格式无关的查询解析器,实现高效且精准的上下文压缩。此外,设计了两阶段微调方案:第一阶段利用NIAH风格数据构建通用检索基础,第二阶段引入多跳推理数据扩展长文本任务覆盖能力,显著提升了在InfiniteBench Code-Debug和LongBench v2上的表现,不仅达到甚至超越全上下文基准精度,且具备跨模型迁移能力。
链接: https://arxiv.org/abs/2606.01336
作者: Mengmeng Ji,Ravi Shanker Raju,Jonathan Lingjie Li,Chen Wu
机构: SambaNova Systems, Inc.
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:As real-world applications increasingly require processing inputs of 100k+ tokens, the gap between context length and inference efficiency has become a critical bottleneck. Context compression offers a way to reduce prefill costs while preserving task accuracy. However, existing training-free attention-based methods leave substantial gaps in demanding long-context tasks such as code reasoning. We present LongAttnComp, a long-context adaptation of AttnComp that fine-tunes a lightweight cross-attention scoring layer and introduces tokenlevel chunking, a token-budget top-p algorithm, positional reordering, and a formatagnostic query parser. We further design a two-stage fine-tuning recipe for the compressor: Stage 1 builds a general retrieval foundation from NIAH-style data, and Stage 2 extends it with multi-hop and reasoning data for broader long-context task coverage. On InfiniteBench Code-Debug, LongAttnComp matches or exceeds full-context accuracy, substantially outperforms training-free baselines, and transfers across four target models from three families. On LongBench v2, the two-stage recipe largely closes the Stage 1 gap on multi-document reasoning while preserving Code-Debug performance.
[NLP-121] DiffuSent: Towards a Unified Diffusion Framework for Aspect-Based Sentiment Analysis
【速读】: 该论文旨在解决生成式情感分析(Generative AI)在细粒度方面情感分析(Aspect-Based Sentiment Analysis, ABSA)任务中,因采用自回归逐标记生成方式导致的边界敏感性不足问题,尤其是在多词方面项与观点项的识别场景下表现不佳。其核心解决方案是提出一种非自回归的扩散框架DiffuSent,将所有ABSA子任务统一建模为边界去噪扩散过程,通过逐步从噪声状态中重构精确的边界信息,实现对多词实体边界的精准捕捉。同时引入对比去噪训练策略,有效抑制扩散过程中由噪声扰动引发的重复预测问题。实验结果表明,DiffuSent在28个不同设置(7个子任务×4个数据集)下均显著优于现有最强的生成式与基于跨度的方法,在多词三元组识别上平均提升2.48 F1,且在包含多个情感三元组的复杂句子中保持稳定的提取精度;此外,非自回归解码机制带来高达181倍的推理速度提升,显著改善了效率瓶颈。
链接: https://arxiv.org/abs/2606.01323
作者: Shu Long,Yanglei Gan,Xuchuan Zhou
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Aspect-Based Sentiment Analysis (ABSA) encompasses seven distinct subtasks, each focusing on different extracted elements. Despite the proven success of generative models in unified aspect sentiment analysis, existing approaches often rely on auto-regressive token-by-token generation without grasping the whole information of the aspect and opinion terms, resulting in boundary insensitivity, particularly in context of multi-word aspect and opinion terms. To address these issues, we present DiffuSent, a non-auto-regressive diffusion framework that systematically formulates all ABSA subtasks as boundary denoising diffusion processes, progressively refining boundaries over noisy states. Furthermore, we introduce a contrastive denoising training strategy which effectively address duplicate predictions with subtle variations introduced by diffusion process. Extensive experiments across 28 settings (7 subtasks x 4 datasets) demonstrate that DiffuSent achieves delivers consistent improvements over the strongest generative and span-based systems. DiffuSent exhibits notable gains on multi-word triplets, achieving an average improvement of +2.48 F1, and maintains robust extraction accuracy in sentences containing multiple sentiment triplets. Moreover, the non-auto-regressive decoding enables substantial efficiency benefits, reaching up to 181 times faster inference than auto-regressive generative baselines
[NLP-122] ukaBench: A Culturally Grounded Jailbreak Benchmark for African Languages
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)安全评估过度依赖英语(English-centric)的问题,尤其关注低资源语言(Low-Resource Languages, LRLs),特别是非洲语言在安全评测中的严重缺失。其核心挑战在于现有安全评估基准缺乏对非英语语境下模型行为的充分覆盖,导致对非洲语言中模型安全性与鲁棒性的认知存在显著空白。为应对这一问题,研究提出TUKABENCH——一个面向七种非洲语言的越狱攻击(jailbreak)评测基准,通过四种差异化设置扩展了原有JailbreakBench(JBB):(1)基于人工翻译的原始提示迁移;(2)将英文提示适配至非洲文化语境后的人工翻译;(3)由人工精心设计并经与GPT-5.2交互验证的提示;(4)融合英语与非洲语言的代码切换(code-switched)提示,以分离语言、文化嵌入性与提示规避性对模型安全的影响。关键发现表明,在非洲语言中提示时,模型拒绝率低于英语,而文化适配提示带来的拒绝率最低,揭示文化语境对模型响应行为的重要影响。此外,研究揭示两大结构性缺陷:模型理解能力不足(model comprehension failures)以及“模型作为裁判”(LLM-as-a-judge)在低资源语言中的可靠性下降。为此,研究引入“偏移”(Deflection)作为新的评估维度,以补充“拒绝”(Refused)与“越狱成功”(Jailbroken);同时通过人工标注验证输出,证实人类与模型判断的一致性在低资源语言及不常用书写系统中显著降低。因此,解决方案的关键在于构建多维度、跨语言、文化敏感的评估框架,并引入人工验证机制与新型评估指标,以更真实地反映低资源语言中模型的安全表现。
链接: https://arxiv.org/abs/2606.01322
作者: Victor Akinode,Senyu Li,Wassim Hamidouche,Waqas Zamir,Inbal Becker-Reshef,David Ifeoluwa Adelani
机构: Mila - Quebec AI Institute(蒙特利尔魁北克人工智能研究所); McGill University(麦吉尔大学); Microsoft AI for Good Research Lab(微软人工智能向善研究实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Safety evaluation of Large Language Models (LLMs) remains heavily English-centric, leaving Low-Resource Languages (LRLs), particularly African ones, critically underexplored. We introduce TUKABENCH, a jailbreak benchmark for seven African languages that extends JailbreakBench (JBB) beyond direct translation through four settings: human translation of JBB prompts, English adaptation to African contexts followed by human translation, human-curated prompts validated through interactions with GPT-5.2, and code-switched prompts combining English and African languages, isolating the effect of language, cultural grounding, and prompt evasiveness on model safety. Across closed and open models, prompting in African languages reduces refusal relative to English, with culturally adapted prompts leading to least refusal. The evaluation also surfaces two structural limitations: model comprehension failures and reduced LLM-as-a-judge reliability in LRLs. To capture the first, we introduce Deflection alongside Refused and Jailbroken; to assess the second, we validate outputs with human annotations, showing that judge-human agreement drops in lower-resource languages and less commonly supported scripts.
[NLP-123] Med-HEAL: Analyzing and Mitigating Hallucinations in Medical LLM s with Hallucination-Aware In-Context Learning
【速读】: 该论文旨在解决医疗领域大语言模型(LLM)在处理复杂电子健康记录(EHR)时产生的幻觉问题,此类幻觉可能对临床决策支持系统构成严重风险。现有评估基准普遍缺乏真实的临床背景,难以有效揭示幻觉的成因及实际缓解路径。为此,作者提出Med-HEAL框架,通过基于MIMIC-IV出院记录构建的EHRNoteQA基准,系统地识别、分析并缓解医疗LLM中的幻觉现象。其关键在于构建了一个高质量的幻觉数据集:利用BioMistral-7B模型在开放式临床问答任务中生成回答,并通过“大模型作为裁判”(GPT-4o)与医学专业学生双轨审核相结合的方式进行标注,实现对答案正确性及推理错误类型的精准标注。在此基础上,论文验证了两种实用的缓解策略——自省式批判(self-critique)与检索增强的上下文学习(RA-ICL),实验表明自省式批判在无需参数更新的情况下显著提升了五种开源模型中三种的准确性(p < 0.05)。Med-HEAL不仅提供可复用的幻觉数据集,更建立了一套兼具临床真实性和可操作性的研究与治理框架,为医疗LLM在临床环境中的安全部署提供了有力支撑。
链接: https://arxiv.org/abs/2606.01301
作者: Yiming Liao,Zeno Franco,Jose Eduardo Lizarraga Mazaba,Keke Chen
机构: University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校); Concordia University Wisconsin(协和大学威斯康星分校); Medical College of Wisconsin(密尔沃基医学院)
类目: Computation and Language (cs.CL)
备注: 12 pages, 5 figures. Preprint full version of an accepted ACM-BCB 2026 short paper
Abstract:Hallucinations in medical large language models (LLMs) pose serious risks for clinical decision support, particularly when models must reason over complex electronic health records (EHRs). However, existing benchmarks often lack a realistic clinical context and provide limited insight into how hallucinations can be mitigated in practice. We introduce Med-HEAL, a framework for systematically identifying, analyzing, and mitigating hallucinations in medical LLMs using clinically grounded data. Building on the EHRNoteQA benchmark derived from MIMIC-IV discharge summaries, we construct a hallucination dataset by evaluating BioMistral-7B on open-ended clinical question answering tasks. Model outputs are labeled through a dual evaluation pipeline that combines LLM-as-a-Judge assessment (GPT-4o) with human auditing by medical student reviewers, producing correctness judgments and annotations of reasoning errors via a custom web-based evaluation system. We then leverage this dataset to investigate mitigation strategies: a self-critique pipeline, in which the test model reviews its own answers to detect potential errors and regenerates responses for flagged cases, and retrieval-augmented in-context learning (RA-ICL), which exposes the model to hallucinated and corrected examples. Experiments across five open-source LLMs-BioMistral, Llama-3.1, DeepSeek, Qwen2.5, and Qwen3, show that the self-critique strategy improves accuracy for three of five models (p 0.05) without requiring parameter updates. Med-HEAL provides both a reusable hallucination dataset and a practical framework for studying and mitigating hallucinations in medical LLMs, supporting safer deployment of AI systems in clinical environments. Our code and data are publicly available at this https URL. Comments: 12 pages, 5 figures. Preprint full version of an accepted ACM-BCB 2026 short paper Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.01301 [cs.CL] (or arXiv:2606.01301v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.01301 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-124] Challenger at MultiPRIDE: Is It Hate Speech or Reclaimed?
【速读】: 该论文旨在解决在社交媒体平台中自动识别仇恨言论(hate speech)时,难以区分真实仇恨言论与被重新定义(reclaimed language)的语言表达这一关键问题。由于被重新定义语言具有高度语境依赖性和语义复杂性,导致标注困难且易产生误判。其解决方案的关键在于提出一种简单且可解释的方法:首先生成密集的语义文本嵌入(dense semantic text embeddings),随后利用Cleanlab结合逻辑回归进行标签噪声过滤,最后通过多层感知机(MLP)神经网络完成分类。该方法在计算资源受限的条件下仍保持优异性能,并在极端类别不平衡的数据集上展现出鲁棒性,同时兼顾模型的可解释性,为未来通过更大规模嵌入模型和更先进预处理技术提升效果提供了可行路径。
链接: https://arxiv.org/abs/2606.01298
作者: Hadi Bayrami Asl Tekanlou,Mahdi Bakhtiyarzadeh,Jafar Razmara
机构: University of Tabriz (Tabriz大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 2 figures, Published in EVALITA 2026, CEUR Workshop Proceedings Vol. 4195
Abstract:The spread of hate speech has become increasingly harmful in modern digital environments, particularly on social networking platforms. While recent advances have shown promising results in automatic hate speech detection, a key challenge remains: distinguishing genuine hate speech from reclaimed language. Accurate labeling is difficult due to the nuanced and context-dependent nature of reclaimed expressions. In this paper, we present a simple and interpretable approach for distinguishing hate speech from reclaimed language, developed for the MultiPride Shared Task. Our method generates dense semantic text embeddings and incorporates a label-noise filtering stage using Cleanlab with logistic regression, followed by a Multi-layer Perceptron (MLP) neural network for final classification. The system is designed to operate under limited computational resources while maintaining strong performance. We evaluate our approach using precision, recall, and F1-score, including macro-averaged metrics. Experimental results demonstrate robust performance despite extreme class imbalance in the dataset. Overall, the findings highlight the potential for further improvements through larger embedding models and more advanced preprocessing techniques while preserving interpretability.
[NLP-125] Dont Read Everything: A Curvature-Conditioned Query for Linear Attention
【速读】: 该论文旨在解决线性注意力(Linear Attention)在上下文内检索(in-context retrieval)和长序列任务中表现不佳的问题。其核心挑战在于:尽管线性注意力通过递归快速权重状态将计算复杂度从二次降低为线性,但其读取阶段仍采用全量历史键向量的加性叠加机制,导致有效信息被大量冗余存储向量稀释。为此,论文提出一种仅作用于读取阶段的轻量级改进机制——曲率条件查询(Curvature-Conditioned Query, CCQ)。CCQ的关键创新在于利用软最大函数(softmax)在各向同性注意力点处的二阶泰勒展开,构建一个局部二次模型,其曲率与运行中的键向量协方差一致,该协方差可通过与线性注意力状态相同的递归/分块机制高效维护。由此得到的线性算子在查询读取前沿记忆中高密度方向对查询进行收缩,从而增强对关键信息的聚焦能力。该方法仅修改读取过程,可与任意线性注意力主干网络(如GLA、Gated DeltaNet)无缝组合,在几乎无额外计算开销的前提下,显著提升困惑度、零样本下游任务准确率、长上下文检索性能(S-NIAH)、长度外推能力(4K至20K)以及LongBench基准表现。
链接: https://arxiv.org/abs/2606.01294
作者: Dong Le,Thong Nguyen,Cong-Duy Nguyen,Anh Tuan Luu
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); VinUniversity (越南大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 19 pages
Abstract:Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax’s geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.
[NLP-126] BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution
【速读】: 该论文旨在解决当前前沿大语言模型(Large Language Models, LLMs)在代码生成任务中因基准测试(benchmark)普遍饱和而导致的模型能力难以区分、训练信号失效的问题。随着模型性能持续提升,现有数据集(如LiveCodeBench)已无法有效区分不同模型的能力,例如在简单任务上模型通过率(Pass@1)超过99%,整体平均也超过90%,导致评估与训练价值下降。其核心解决方案是提出BenchEvolver——一种以任务演化为核心的自动生成框架,通过结构化变换已有参考解法(reference solutions),并基于演化的可执行语义自动推导出新的题目描述与测试用例,从而实现高质量、高难度、多样化且可验证正确性的新任务构建。该方法避免了从零生成问题的高成本,同时确保生成任务在语义上与原问题一致且更具挑战性。实验表明,将BenchEvolver应用于LiveCodeBench和SciCode后,生成的任务显著更难,且保持了参考解的正确性与多样性;由此构建的LiveCodeBench-Plus基准包含91个难题,使前沿模型的通过率降至27.5%至62.6%,恢复了模型间的有效区分度。更重要的是,演化后的任务对生成它们的模型仍具挑战性,支持模型自我改进;强化学习(RL)在演化任务上的训练进一步提升了模型性能,相较仅使用原始种子数据的训练,分别在LCB v6 Hard和LCB-Pro Easy上取得+8.7和+8.3的通过率增益,提升幅度分别高出70.7%和34.8%。结果表明,BenchEvolver能够将趋于饱和的基准转化为具备前沿评估与训练价值的动态体系,为大模型持续发展提供可扩展的评测与优化路径。
链接: https://arxiv.org/abs/2606.01286
作者: Yangzhen Wu,Aaron J. Li,Wenjie Ma,Li Cao,Ziheng Zhou,Mert Cemri,Shu Liu,Yuran Xiu,Chenxiao Yan,Haikun Zhao,Bin Yu,Ion Stoica,Dawn Song
机构: University of California, Berkeley; Institute for Interdisciplinary Information Sciences, Tsinghua University
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.
[NLP-127] Worlds Within Words: Translating Culture in Ancient Chinese Texts with Multi-Agent Coordination
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在翻译古汉语文献时对文化负载词(Culture-Loaded Words, CLWs)处理不足的问题,核心挑战在于如何在保留原文文化内涵的同时,合理决定何时以及以何种程度对文化相关知识进行显性化解释,避免因直译导致概念丢失或因过度解释影响文本简洁性与可读性。其解决方案的关键是将文化负载词的翻译建模为一种选择性显性化(selective explicitation)任务,并提出一种多智能体文化感知翻译框架——MACAT(Multi-Agent Culture-Aware Translation)。MACAT通过动态识别文化显著性短语,在必要时注入简洁的解释性知识,同时引入质量感知重排序模块和多轮评估智能体,从术语准确性、可读性、忠实度、文化保留度及文化显性化等多个维度综合评估翻译质量。实验结果表明,在中医经典和《论语》等古籍文本上,MACAT在统一的GPT-5.4评估环境下,显著优于基线模型与通用机器翻译方法。
链接: https://arxiv.org/abs/2606.01276
作者: Xiaoqi He,Kaixin Lan,Mu You,Tao Fang,Lidia S. Chao,Derek F. Wong
机构: 未知
类目: Computation and Language (cs.CL)
备注: The preprint manuscript is 20 pages long and is currently under review
Abstract:Large language model (LLM)-based machine translation has advanced cross-cultural communication, yet it still struggles with culture-loaded words (CLWs) in ancient Chinese texts. The challenge extends beyond lexical alignment to deciding when and how culture-dependent knowledge should be explicated for readers lacking relevant background. Literal translation often preserves surface forms while missing underlying concepts, whereas over-explicitation harms conciseness and readability. To address this problem, we formulate CLW translation as a selective explicitation task and propose \textbfMACAT, a \textbfMulti-\textbfAgent \textbfCulture-\textbfAware \textbfTranslation framework that dynamically identifies culturally salient phrases and injects concise explanatory knowledge when necessary. MACAT further incorporates a quality-aware reranking module for candidate selection and a multi-round evaluation agent that assesses translations across terminological precision, readability, fidelity, cultural preservation, and cultural explicitation. Experiments on traditional Chinese medicine (TCM) classics and the \textitAnalects show that, under a unified GPT-5.4 evaluation setting, MACAT consistently outperforms both the backbone model and general-purpose MT baselines on 100 TCM documents and a 20-chapter subset of the \textitAnalects.
[NLP-128] IndoBias: A Dual Track Culturally Grounded Benchmark for LLM s Bias Evaluation in Indonesian Languages
【速读】: 该论文旨在解决印度尼西亚这一多民族、多语言、多元文化背景下大型语言模型(Large Language Models, LLMs)中存在的代表性偏见与本土化刻板印象评估缺失的问题。由于印尼拥有超过1300个族群和700种原住民语言,而现有研究尚未充分考察本地语境下的模型偏见,导致在文化特定语境中对公平性评估存在显著空白。为此,本文提出IndoBias——一个基于印尼文化语境的偏见基准测试框架,用于评估印尼语及三种地方语言(爪哇语、巽他语、望加锡语)中的模型偏见。其解决方案的关键在于构建双视角评估体系:一是侧重深度的对比对(contrastive-pairs)评估,二是侧重广度的生成式评估,后者基于社会科学研究框架(SPI、O*NET 和 WGI),以增强评估的理论基础与现实相关性。实验结果表明,现有解码器类模型在印尼语中对典型句式表现出显著偏见,而地方语言在意识形态与宗教类别上则面临更高程度的偏见;同时,模型对不同地方实体的回应呈现出非均匀的刻板印象极性。此外,研究发现,在预训练阶段,来自Common Crawl的文本比经人工审核的文章(如维基百科、新闻)引入更多偏见,而将地方语言纳入预训练数据通常会加剧偏见。本研究强调了在文化特定语境下开展偏见研究的重要性,并为未来多语言、跨文化语境下的公平性评估提供了方法论支持。
链接: https://arxiv.org/abs/2606.01260
作者: Ikhlasul Akmal Hanif,Muhammad Falensi Azmi,Filbert Aurelian Tjiaranata,Eryawan Presma Yulianrifat,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Universitas Indonesia (印度尼西亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite being home to more than 1300 ethnic groups and 700 indigenous languages, bias in Large Language Models has not been fully studied in Indonesia, thus leaving a critical gap in evaluating representational fairness and localized stereotypes within its uniquely vast, multilingual, and diverse sociocultural landscape. To address this, we introduce IndoBias as a culturally-grounded bias benchmark to assess LLMs bias in Indonesian and three local languages: Javanese, Sundanese, and Makasar. IndoBias features dual perspective evaluation tracks: depth-oriented (with contrastive-pairs) and breadth-oriented (with generation-based), where the latter is grounded in social science frameworks (SPI, O*NET, and WGI). Our results show that existing LLMs – particularly decoder models – exhibit strong bias towards prototypical sentences in Indonesian, while local languages suffer higher bias under Ideology and Religion category. We also find that LLMs responses exhibit a non-uniform Stereotype Polarity when prompted with various local entities. Finally, we discover that, in Indonesian, Common Crawl texts introduce more bias during pretraining, compared to human-reviewed article texts (e.g., Wikipedia, News), whereas introducing local languages to pretraining generally increases bias. This work highlights the importance of studying bias in culture-specific context. Warning: This paper contains example data that may be offensive, harmful, or biased.
[NLP-129] Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding
【速读】: 该论文旨在解决标准位置编码(Positional Encoding, PE)在Transformer模型中对位置信息建模的局限性问题:现有方法如正弦编码和旋转位置编码(Rotary Position Embedding, RoPE)将所有位置视为同等局部,仅编码令牌的位置信息,而无法动态捕捉位置影响的传播范围。其核心解决方案是提出摩尔特位置编码(Morlet Positional Encoding, MoPE),利用具有最小位置-频率不确定性特性的摩尔特小波作为位置编码的基础,使每个嵌入维度能够从数据中自学习其独特的频率与局部带宽(locality bandwidth)。MoPE的关键在于通过可学习的参数实现对位置影响范围的动态建模,其中相位部分精确恢复了RoPE的旋转角度,振幅部分则引入了可学习的高斯局部核,弥补了传统编码的不足。理论层面,正弦编码和RoPE相关核函数被证明均为MoPE在局部性参数趋于无穷时的极限情形;实验表明,MoPE结合能量门控注意力(Energy-Gated Attention)在TinyShakespeare任务上相较标准注意力提升0.119,且优于单一组件。对学习参数的分析进一步发现,所有128个频率-带宽组合均收敛至小波可容许性边界,这一现象与能量门控的理论结果一致,暗示字符级语言信号中存在可复现的内在结构特征,值得深入研究。
链接: https://arxiv.org/abs/2606.01258
作者: Athanasios Zeris
机构: Independent Researcher, Athens, Greece
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Signal Processing (eess.SP)
备注: 16 pages, 4 figures, 4 tables
Abstract:Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i - infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.
[NLP-130] Agent ic Clustering: Controllable Text Taxonomies via Multi-Agent Refinement
【速读】: 该论文旨在解决现有文本聚类方法中因采用固定程序化流程而导致的泛化能力差与灵活性不足的问题。传统方法依赖预设的大型语言模型(LLM)调用序列及终止、合并、拆分聚类的规则,难以适应不同结构的语料库,且无法便捷地融入用户指定的约束条件(如目标聚类数量或聚类意图)。其解决方案的关键在于提出一种基于代理(agentic)的动态聚类框架:由一个协调器(orchestrator)LLM在每一步评估聚类发现过程的状态,并根据需要调度一组功能专一的代理(包括提议者、合成者、审计者、调查者和批评者),使聚类流程能够自适应地响应语料特性,而非执行固定的代码逻辑。在七个公开的文本聚类基准测试中,该方法实现了最先进的性能,相较于最强的先前LLM基线,在调整后的兰德指数(ARI)上最高提升达32%。
链接: https://arxiv.org/abs/2606.01255
作者: Simon Löwe,Emily Silcock
机构: Burning Glass Institute; Harvard University
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent text-clustering methods use large language models to propose a cluster taxonomy from a corpus and then assign each text to it. These pipelines are fundamentally programmatic: the sequence of LLM calls and the rules for stopping, merging, and splitting clusters are fixed in code in advance, so they generalise poorly across corpora of different structure and cannot easily incorporate user-supplied constraints such as a target cluster count or a clustering intent. We propose an agentic alternative in which an orchestrator LLM inspects the state of the discovery process at each step and dispatches one of a small set of specialised agents - proposer, synthesizer, auditor, investigator, and critic - adapting the pipeline to the corpus rather than executing a fixed one. On seven public text-clustering benchmarks the method achieves state-of-the-art performance, beating the strongest prior LLM baseline by up to 32% in ARI.
[NLP-131] Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization
【速读】: 该论文旨在解决多目标跨语言文本摘要(Multi-target cross-lingual text summarization, MTXLS)任务中缺乏系统性评估基准与模型内在机制理解不足的问题。当前MTXLS虽在多语言内容消费背景下日益重要,但研究仍处于初步阶段,尤其在性能表现上远落后于英语单语摘要。为填补这一空白,研究提出一个涵盖24种目标语言的新型基准——多目标跨语言元素感知(Multi-target cross-lingual element-aware, MEA),并系统评估端到端与流水线式大语言模型(LLM)方法的表现。关键发现表明,翻译与摘要行为并非在模型中以分离阶段实现,而是在深层网络中协同涌现,且多数任务相关处理及错误均集中于这些深层模块。基于此,研究提出一种推理时激活引导(inference-time activation steering)方法,利用英语摘要任务中的隐藏表示来指导多语言摘要生成,显著提升了跨语言摘要质量,验证了该方法在不同目标语言上的普适有效性。
链接: https://arxiv.org/abs/2606.01252
作者: Sangwon Ryu,Yihong Liu,Mingyang Wang,Yunsu Kim,Jungseul Ok,Gary Geunbae Lee,Hinrich Schuetze
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-target cross-lingual text summarization (MTXLS), which summarizes a source document into multiple target languages, is increasingly important as users consume content in diverse languages, but remains underexplored. To address this gap, we introduce multi-target cross-lingual element-aware (MEA), a new MTXLS benchmark covering 24 target languages. We benchmark end-to-end and pipeline approaches across various LLMs and show that MTXLS performance still substantially lags behind English monolingual summarization. To better understand MTXLS in LLMs, we propose a layer-wise analysis framework for investigating how LLMs internally perform MTXLS. Our analyses suggest that translation and summarization behaviors emerge jointly within later layers rather than as distinctly decomposed stages. Most task-relevant processing occurs within these layers, and errors also tend to arise at similar depths. Motivated by these findings, we introduce an inference-time activation steering method that leverages hidden representations from English summarization to guide MTXLS generation. Experiments show that our method consistently improves MTXLS quality across target languages.
[NLP-132] rust Region On-Policy Distillation
【速读】: 该论文旨在解决生成式大语言模型(Large Language Models, LLMs)在后训练阶段采用在线策略蒸馏(On-Policy Distillation, OPD)时因教师与学生分布差异显著而导致的训练不稳定性问题。具体而言,当教师对学生产生的令牌(tokens)提供监督信号时,若分布不匹配,会导致不可靠的策略梯度估计,进而引发优化失败。为应对这一挑战,论文提出了一种基于可信区域的在线策略蒸馏方法(Trust Region On-Policy Distillation, TrOPD),其核心解决方案在于:通过可信区域机制仅在教师提供可靠监督的区域内执行OPD,从而缓解因分布偏移导致的逆KL散度(reverse-KL)估计困难;同时引入异常值检测策略,结合梯度裁剪、掩码处理和前向KL估计以抑制不可靠监督的影响;此外,通过利用教师前缀进行离线策略引导(off-policy guidance),并采用前向KL实现模仿学习,促进学生在探索过程中向高置信度区域收敛。实验表明,TrOPD在数学推理、代码生成及通用领域基准测试中均显著优于当前主流的OPD基线方法(如OPD、EOPD、REOPOLD)。
链接: https://arxiv.org/abs/2606.01249
作者: Xingrun Xing,Haoqing Wang,Boyan Gao,Ziheng Li,Yehui Tang
机构: Samsung Research(三星研究院); University of Oxford(牛津大学); Peking University(北京大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
[NLP-133] Unlocking the Black Box of Latent Reasoning : An Interpretability-Guided Approach to Intervention
【速读】: 该论文旨在解决生成式AI(Generative AI)中大语言模型(Large Language Models, LLMs)在执行多步推理时,尽管通过隐式推理(latent reasoning)提升了效率,但其内部连续隐藏状态(continuous hidden states)缺乏可解释性与可控性的问题。核心挑战在于,虽然隐式推理避免了显式思维链(Chain-of-Thought, CoT)的冗余计算,但其内部表示的“黑箱”特性限制了对推理过程的理解与干预能力。论文的关键解决方案是基于结构、因果与几何探测的系统性分析,揭示了隐向量中编码了压缩且忠实的推理步骤信息,且早期隐向量作为关键因果枢纽起主导作用。在此基础上,提出一系列无需训练、仅在解码阶段实施的干预方法,通过施加所发现的几何与语义先验,对隐式推理过程进行精细化调控。实验表明,该方法在多种模型规模和任务领域下均能显著提升推理准确性,验证了可解释性引导的干预策略可在不更新参数的前提下有效释放模型潜在推理能力。
链接: https://arxiv.org/abs/2606.01243
作者: Shuochen Chang,Tong Bai,Xiaofeng Zhang,Qianli Ma,Qingyang Liu,Zhaohe Liao,Yibo Miao,Li Niu
机构: Shanghai Jiao Tong University (上海交通大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.
[NLP-134] Efficient RAG with Intent-Aware Retrieval and Semantics-Preserving Chunking
【速读】: 该论文旨在解决传统检索增强生成(Retrieval-Augmented Generation, RAG)系统在实际应用中面临的两大核心问题:意图无关的检索(intent-agnostic retrieval)与信息碎片化(information fragmentation)。前者导致检索结果与用户查询意图匹配度不足,后者则因文本分块方式破坏了关键证据的语义完整性,进而影响生成质量。为应对上述挑战,本文提出一种名为InSemRAG的新型RAG框架,其核心创新在于采用迭代式“检索-验证”机制,并集成两个关键组件:意图感知检索器(Intention-Aware Retriever, IAR)与语义保全分块(Semantics-Preserving Chunking, SPC)。IAR通过动态融合多种检索通道并依据查询意图自适应调整权重,实现更精准的上下文相关检索;SPC则负责识别并修复受损的证据片段,以维持原始语义连贯性。此外,为缓解迭代机制带来的计算延迟,系统引入小语言模型(Small Language Models, SLMs)进行轻量化推理,显著降低延迟。实验结果表明,InSemRAG在多跳问答与依赖证据的任务上表现突出,在HotPotQA上F1提升2.65点,在FEVER上准确率提升1.5点,且相较主流多跳RAG方法在保持竞争力的同时,实现了4.32倍的延迟降低。
链接: https://arxiv.org/abs/2606.01240
作者: Fachrina Dewi Puspitasari,Chaoning Zhang,Jiaquan Zhang,Zhicheng Wang,Hafiz Shakeel Ahmad Awan,Rizwan Qureshi,Jewon Lee,Tae-Ho Kim,Yang Yang
机构: University of Electronic Science and Technology of China; Massachusetts General Hospital, Harvard University; Nota AI
类目: Computation and Language (cs.CL)
备注:
Abstract:The demand for powerful instruction following and reasoning capability of large language models (LLMs) has promoted rapid development of retrieval-augmented generation (RAG). The RAG system assists LLM generation by retrieving chunks of query-fit supplementary knowledge from an external database. Conventional RAG systems, however, suffer from information insufficiency due to two factors, which are intent-agnostic retrieval and information fragmentation. Our work proposes a RAG framework, termed InSemRAG, that addresses these challenges via an iterative retrieve-and-check mechanism with two supporting modules, an intention-aware retriever (IAR) and semantics-preserving chunking (SPC). IAR implements a dynamic hybrid retrieval method that adaptively weights the retrieval channels based on the query intent, while SPC performs detection and reparation to the damaged evidence chunks to preserve the semantic integrity. To alleviate the computational latency brought by our iterative mechanism, we leverage small language models (SLMs). Extensive experiments across several benchmark datasets consistently demonstrate the competitiveness of our method against recent state-of-the-art RAG mechanisms. Particularly, our method achieves significant gains on multi-hop and evidence-sensitive tasks, with a 2.65-point improvement in F1 on HotPotQA and a 1.5-point increase in accuracy on FEVER. Our method also achieves competitive performance to Multi-Hop RAG with 4.32 \times lower latency with the utilization of SLM.
[NLP-135] Connecting the Dots: Benchmarking Reflective Memory in Long-Horizon Dialogue
【速读】: 该论文旨在解决现有长上下文建模基准在评估模型反射性记忆(reflective memory)能力方面的不足,即当前基准多局限于显式事实记忆的直接召回,无法衡量模型将分散的、多模态线索整合为高层级理解所需的深层推理能力。其核心解决方案是提出一个名为RefMem-Bench的基准,包含26,000个标注的问答实例,涵盖八种反射性记忆维度和三种任务形式,要求模型超越表层信息检索,从交互历史中分布式的证据中推断潜在语义。为增强模型的反射性记忆能力,论文进一步提出一种分层框架——反射性记忆诱导(REMIND),将反射性记忆视为渐进式意义建构过程,通过问题感知的证据检索、显著性感知的定位以及抽象层级监督,并结合渐进式反射对齐(Progressive Reflective Alignment)机制,将高层级的反思推理能力蒸馏至事实推理路径中。实验表明,RefMem-Bench对现有模型构成显著挑战,而REMIND通过逐步实现证据感知、定位与抽象,持续提升答案准确率与记忆召回性能。
链接: https://arxiv.org/abs/2606.01223
作者: Jingjie Lin,Bingbing Wang,Zihan Wang,Zhengda Jin,Weiming Qiao,Jing Li,Ruifeng Xu
机构: Harbin Institute of Technology, Shenzhen; The Hong Kong Polytechnic University; Fudan University; Peng Cheng Laboratory
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Despite substantial progress in long-context modeling, existing benchmarks remain confined to factual memory for explicit recall, failing to measure the reflective memory required to synthesize fragmented, multimodal cues into high-level interpretations. To address this gap, we introduce RefMem-Bench, a benchmark for reflective memory in long-horizon dialogue. RefMem-Bench contains 26K annotated QA instances with eight reflective-memory dimensions and three task formats, requiring models to move beyond surface-level retrieval and infer latent meanings from evidence distributed across interaction histories. To enhance reflective memory capability, we propose REflective Memory INDuction (REMIND), a hierarchical framework that treats reflective memory as progressive meaning construction. REMIND couples question-conditioned evidence retrieval, salience-aware grounding, and abstraction-level supervision, and uses Progressive Reflective Alignment to distill high-level reflective reasoning into the factual inference pathway. Experiments show RefMem-Bench poses a substantial challenge to current models, while REMIND consistently improves both answer accuracy and memory recall through progressive evidence perception, grounding, and abstraction.
[NLP-136] Distilling Neuro-Symbolic Programs into 3D Multi-modal LLM s ICML2026
【速读】: 该论文旨在解决当前3D空间推理方法中存在的根本性权衡问题:神经符号3D(Neuro-Symbolic 3D, NS3D)方法虽能通过组合式程序实现可解释的推理,但受限于封闭词汇表和简单程序;而端到端的3D多模态大语言模型(3D Multi-Modal Large Language Models, 3D MLLMs)虽可处理复杂自然语言与开放词汇概念,却存在黑箱推理且缺乏显式的空间验证机制。为此,本文提出APEIRIA,一种融合神经符号与3D多模态大语言模型的新型框架,其核心在于通过自然语言思维链(Chain-of-Thought, CoT)将符号推理模式从NS3D方法中提炼并注入3D MLLMs中。该方案采用三阶段课程学习策略:(a) 3D感知对齐将物体的视觉-几何特征与大语言模型(LLM)进行对齐;(b) CoT-SFT(Chain-of-Thought Supervised Fine-Tuning)利用符号程序轨迹教导查询分解与逐步验证能力;© CoT-RL(Chain-of-Thought Reinforcement Learning)进一步拓展推理模式至开放词汇概念与深层嵌套指令。通过传递推理模式而非特定概念知识,APEIRIA保留了NS3D方法的核心优势——透明化推理过程以及规划与感知模块的模块化可替换性。在定位、问答与描述生成任务上的实验表明,APEIRIA不仅超越了以往的NS3D方法,且在主流3D空间推理数据集上达到顶尖3D MLLMs的性能水平,实现了符号方法的系统性推理能力与大语言模型灵活性的统一。
链接: https://arxiv.org/abs/2606.01215
作者: Wentao Mo,Yang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: To appear in ICML 2026
Abstract:Current 3D spatial reasoning methods face a fundamental trade-off: neuro-symbolic 3D (NS3D) concept learners achieve interpretable reasoning through compositional programs but are constrained to closed-set concept vocabularies and simple programs; end-to-end 3D multi-modal LLMs (3D MLLMs) could handle complex natural language and open-vocabulary concepts but suffer from black-box reasoning without explicit spatial verification. We introduce APEIRIA, a neuro-symbolic 3D MLLM to bridge two paradigms by distilling symbolic reasoning patterns into MLLMs with natural language chain-of-thought. Our three-stage curriculum progressively builds reasoning capabilities: a) 3D perception alignment grounds object visual-geometric features to the LLM, b) CoT-SFT teaches query decomposition and stepwise verification from symbolic program traces, and c) CoT-RL extends reasoning patterns to open-set concepts and deeply nested instructions. By transferring reasoning patterns rather than concept-specific knowledge, APEIRIA preserves key NS3D virtues: transparent reasoning and modular interchangeability of planning and perception components. Evaluations on grounding, question answering, and captioning show that APEIRIA surpasses prior NS3D methods and matches state-of-the-art 3D MLLMs on 3D spatial reasoning datasets, unifying symbolic methods’ systematic reasoning with MLLMs’ flexibility. Code is available at this https URL.
[NLP-137] ECCI: Tricky Edits of Collected and Curated Images
【速读】: 该论文旨在解决当前文本引导图像编辑方法在指令遵循、对源图像的最小修改以及保证高视觉质量方面存在的系统性挑战,尤其针对位置、运动、视角、尺度及创造性编辑等复杂编辑任务表现不佳的问题。其解决方案的关键在于提出一个全新的图像编辑评估基准——TECCI(Tricky Edits of Collected and Curated Images),该基准包含7个类别共7550组图像与编辑指令对,通过自动化的Gemini生成5类编辑指令,并结合人工精心设计的530张具有高难度编辑指令的图像,全面覆盖现有方法的薄弱环节。为实现高效评估,研究还构建了一个基于Gemini的自动评分模型,实现了与人类评价74.7%的一致性。实验结果表明,当前主流模型在整体成功率上均未超过22%,且在空间布局理解、细节保持和创造性编辑方面存在显著不足,揭示了现有生成式图像编辑技术在复杂语义理解与精细控制方面的局限性。
链接: https://arxiv.org/abs/2606.01213
作者: Aishwarya Agrawal,Roy Hirsch,Yasumasa Onoe,Sherry Ben,Jason Baldridge
机构: Google Research (谷歌研究); Google DeepMind (谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Despite tremendous recent progress, current text-guided image editing methods still struggle with many aspects of editing involving instruction following, minimally editing the source image, and ensuring high visual quality. These problems are especially apparent when the requested edit is challenging, such as those that involve position, motion, viewpoint, scale and creative edits. To systematically test generative image editors, we propose a novel image editing benchmark – TECCI: Tricky Edits of Collected and Curated Images. TECCI consists of a completely new set of images we are releasing. The images in TECCI span 7 image categories. The images and these categories were curated intentionally to target weaknesses of existing methods. The edit instructions in TECCI are automatically generated by Gemini, covering 5 edit types per source image. We also curated a set of 530 images for which we created challenging manually written edit instructions. Overall, TECCI contains 7550 pairs of images and edit instructions. We conduct human evaluations of five leading image editing models on TECCI. Humans judge outputs along three dimensions: 1) instruction following, 2) minimality of the edits, and 3) visual quality. To scale-up the evaluation, we also build an auto-rater using Gemini that achieves 74.7% accuracy in matching human evaluations. Our evaluations reveal that: 1) none of the models exceed a 22% overall success rate, demonstrating the challenging nature of TECCI, 2) Nano Banana Pro is the best performing model overall, 3) models perform significantly better at instruction following compared to minimal edits and visual quality, 4) models struggle with editing architecture and nature images which require strong understanding of spatial layout and intricate visual details. 5) reasoning and creative edits are the most difficult, whereas color and appearance edits are the easiest.
[NLP-138] Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在医疗分诊场景中是否存在因患者提示语语言不同而引发推荐差异的问题,尤其关注语言因素是否导致模型对相同症状产生不一致的紧急程度判断。研究发现,尽管所有语言版本对症状严重性评分高度一致(7.7–8.0/10),但推荐急诊就诊的比例却在0%(日语、印地语)至30%(英语、阿拉伯语)之间显著波动。其关键解决方案在于揭示了模型存在基于输入语言隐含推断地理位置的偏差机制:在非英语提示中加入“美国”地理位置信息可使急诊推荐率提升最高达76.7个百分点,而将英语提示与“东京”位置结合则使推荐率从30%降至6.7%。通过回译控制实验(日语→英语)验证,该差异并非由翻译质量引起,而是源于模型对语言与地理之间的隐含关联进行错误推断,表明当前大语言模型在跨语言医疗决策中存在严重的文化-地理偏见风险。
链接: https://arxiv.org/abs/2606.01204
作者: Qi Han Wong
机构: Google(谷歌)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 7 pages, 4 tables. Code and data at this https URL
Abstract:We investigate whether large language models produce different medical triage recommendations for identical symptoms based solely on the language of the patient prompt. Using Gemini 3.5 Flash, we evaluate a neurological symptom profile (persistent headache, blurred vision, nausea) across six languages (English, Spanish, Chinese, Hindi, Japanese, Arabic) with 30 runs per condition (n=450 total API calls). We find that the model recommends emergency room visits at rates ranging from 0% (Japanese, Hindi) to 30% (English, Arabic), despite assigning nearly identical severity scores (7.7-8.0/10) across all languages. Adding a single sentence specifying the patient’s US location increases ER recommendations by up to 76.7 percentage points for non-English prompts, while the reverse anchor (English prompt with a Tokyo location) reduces the ER rate from 30% to 6.7%. A back-translation control (Japanese to English) produces ER rates comparable to the English baseline, confirming that the disparity is not caused by translation quality but by implicit geographic inference from the input language. We release the complete dataset, experiment code, and results.
[NLP-139] he Shape of Wisdom: Decision Trajectories in Language Models
【速读】: 该论文旨在解决大语言模型在生成答案过程中,其输出决策机制并非简单地在最终层进行“选择”,而是随着推理深度的推进呈现出结构化的动态演变问题。核心挑战在于理解模型内部状态(尤其是答案置信度的边际变化)如何随层深演化,并识别哪些因素真正驱动正确答案的稳定形成。解决方案的关键在于提出一种可复现的分析框架,通过三个关键量对每条推理轨迹进行刻画:当前答案的置信度边际(current answer margin)、下一层该边际的变化量(next-layer change in that margin),以及距离决策翻转点的距离(distance from a decision flip)。研究发现,正确性与稳定性高度分离——多数答案属于“不稳定但正确”而非“稳定且正确”。进一步分析表明,在稳定正确的案例中,注意力机制的平均标量方向倾向于支持正确答案,而前馈网络(MLP)的平均标量则无此倾向;通过文本片段删除实验发现,移除支持性内容会削弱置信度边际,而移除干扰项类似内容反而有助于提升边际。该方法虽未提供完整的因果解释回路,但为识别哪些答案已确定、哪些仍脆弱,以及哪些具体成分(如特定文本或模块响应)推动了边际变化,提供了可重复、可量化的分析路径。
链接: https://arxiv.org/abs/2606.01202
作者: Shailesh Rana
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 6 pages, 5 figures. Code and derived artifacts: this https URL
Abstract:Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.
[NLP-140] Low-Resource Safety Failures Are Action Failures Not Representation Failures
【速读】: 该论文旨在解决高资源语言中学习到的安全对齐(safety alignment)在低资源语言中迁移效果差的问题,即模型在英语中能有效拒绝有害提示,但在将其翻译为斯瓦希里语或缅甸语等低资源语言后却无法实现相同级别的拒绝行为。尽管跨语言的有害性表征(harmfulness direction)在低资源语言中依然存在且可被线性分离,但模型在决策层面缺乏对安全性的校准(calibration),导致其无法将正确的表征转化为有效的拒绝动作。解决方案的关键在于不进行重新训练,而是通过重校准(recalibrating)已有的高资源语言安全门控机制——具体采用一个低秩逻辑读出层,并仅需每类1至4个目标语言样本即可重新设定决策阈值,从而显著提升低资源语言下的拒绝选择性(从33.6提升至54.5),同时保持多任务语言理解基准(MMLU)性能。研究结果表明,部分低资源语言中的安全失效问题可通过现有表征的校准而非重新学习新表征来修复。
链接: https://arxiv.org/abs/2606.01196
作者: Rashad Aziz,Ikhlasul Akmal Hanif,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Safety alignment learned in high-resource languages transfers poorly to low-resource languages. Models refuse harmful prompts in English but fail to refuse when the same prompts are translated into Swahili or Burmese. Adaptive steering methods like AdaSteer and CAST inherit this failure cross-lingually. We diagnose where transfer breaks down. Across Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B on 23 languages, the harmfulness direction extracted from high-resource activations linearly separates harmful from harmless low-resource prompts nearly as well as high-resource ones. The relevant representation is present. Yet harmful refusal drops from 87.9% to 43.9%. The model fails to convert the representation into refusal. What fails to transfer is calibration of the safety decision, not the underlying representation. We exploit this by recalibrating, rather than retraining, a high-resource gate: a low-rank logistic readout with its decision threshold reset using as few as 1 to 4 target-language examples per class. The gate routes between refusal steering and harmfulness-direction ablation, substantially raising mean refusal selectivity ( \Delta = harmful - harmless refusal) from 33.6 for the strongest adapted baseline to 54.5 while preserving MMLU utility. These results suggest that some low-resource safety failures can be repaired by recalibrating existing representations rather than learning new ones. Our code is released: this https URL.
[NLP-141] CA-BED: Conversation-Aware Bayesian Experimental Design ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在交互式场景下因需通过主动提问获取信息而导致性能下降的问题。核心挑战在于如何选择能够有效降低不确定性、同时处理模糊或部分信息性回答的最优问题。其解决方案的关键是提出一种基于贝叶斯实验设计的对话感知推理框架——对话感知贝叶斯实验设计(Conversation-Aware Bayesian Experimental Design, CA-BED),该方法在推理阶段融合贝叶斯实验设计与基于LLM的似然估计,通过维护假设的信念分布、预判可能的回答,并在模拟对话树中传播期望信息增益,实现多轮对话中问题选择的优化。实验结果表明,CA-BED在两个结构化实体推断基准上相比直接提示(direct prompting)平均提升21.8%的成功率,且仅增加约1.8轮对话,显著优于其他信息获取方法。
链接: https://arxiv.org/abs/2606.01182
作者: Daniel Arnould,Rashad Aziz,Zixuan Kang,Tanav Changal,Kevin Zhu,Sunishchal Dev,Gabriel Grand,Shreyas Sunil Kulkarni
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Reliable Autonomy Workshop at ICLR 2026
Abstract:Large Language Models (LLMs) excel at static reasoning tasks, yet their performance often degrades in interactive scenarios where information must be actively acquired through questioning. A key challenge lies in selecting questions that reduce uncertainty while incorporating responses that may be ambiguous or only partially informative. To address this, we propose Conversation-Aware Bayesian Experimental Design (CA-BED), an inference-time probabilistic dialog planning framework that integrates Bayesian Experimental Design with LLM-based likelihood estimation to optimize question selection over multiple conversational turns. CA-BED maintains a belief distribution over hypotheses, anticipates possible answers, and propagates expected information gain through a simulated conversation tree. Across two structured entity-deduction benchmarks, CA-BED yields an average 21.8% improvement in success rates over direct prompting, with comparable gains relative to alternative information-seeking methods. It achieves these gains with an average increase of only 1.8 conversational turns compared to direct prompting.
[NLP-142] hinking Economically: A Hierarchical Framework for Adaptive-Complexity Reasoning in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在使用思维链(Chain-of-Thought, CoT)推理时因“过度思考”(overthinking)导致的计算开销过大的问题,即生成过长的推理过程却未能带来相应的准确率提升。现有高效方法多采用统一压缩策略,忽视了推理复杂性在不同问题之间以及单个推理步骤内部存在的异质性。为此,论文提出“思维经济性”(Thinking Economically)原则,强调根据任务本身及推理步骤的内在需求智能分配计算资源,而非追求形式上的简洁。其核心解决方案是层级自适应预算框架(Hierarchical Adaptive Budgeter, HAB),通过粗粒度到细粒度的双重预算机制实现:在跨步骤层面,预测每个问题的最优推理深度;在步骤内层面,基于概率语言模型(PPL)导出的步骤间比较学习步级粒度的令牌预算信号,并引入自适应帕累托优化目标以捕捉局部质量-效率权衡关系;同时,结合费舍尔信息(Fisher Information)基的剪枝器,在训练阶段提供细粒度指导,促使生成器内化更经济的推理模式。实验结果表明,HAB在GSM8K和MATH500数据集上不仅优于标准CoT的准确性,还显著降低令牌消耗,实现了比现有基线更强的性能-效率权衡。
链接: https://arxiv.org/abs/2606.01168
作者: Yubo Gao,Haotian Wu,Hong Chen,Junquan Huang,Yibo Yan,Jungang Li,Zihao Dongfang,Sicheng Tao,Puay Siew Tan,Jie Zhang,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou); The Hong Kong University of Science and Technology; Nanyang Technological University; Singapore Institute of Manufacturing Technology, A*STAR
类目: Computation and Language (cs.CL)
备注: 11 pages, 4 figures, 3 tables
Abstract:Chain-of-Thought (CoT) has significantly enhanced LLM reasoning, yet often incurs substantial computational overhead due to “overthinking”: generating excessively long rationales without commensurate accuracy gains. Existing efficiency methods typically apply uniform compression, which overlooks a critical observation that reasoning complexity is heterogeneous at two distinct granularity: across different problems and within individual reasoning steps. This motivates our principle of Thinking Economically: intelligently allocating computational resources based on intrinsic task and step demands rather than pursuing uniform brevity. We propose Hierarchical Adaptive Budgeter (HAB), a training framework that operationalizes this principle through coarse-to-fine budgeting. At the inter-step level, HAB predicts the optimal reasoning depth for each problem. At the intra-step level, HAB learns step-specific token budgeting signals from PPL-derived step comparisons and an adaptive Pareto optimization objective that captures the local quality-efficiency trade-off, while a Fisher Information-based pruner further provides fine-grained training-time guidance, thereby encouraging the generator to internalize more economical reasoning patterns. Experiments on GSM8K and MATH500 show that HAB not only surpasses standard CoT in accuracy but also reduces token usage, achieving a stronger performance-efficiency trade-off than the compared baselines.
[NLP-143] BraveGuard: From Open-World Threats to Safer Computer-Use Agents
【速读】: 该论文旨在解决生成式AI在计算机使用代理(computer-use agents)场景中因多步执行轨迹导致的安全风险难以通过孤立提示或最终响应检测的问题。传统安全防护机制依赖于静态的、基于基准测试的合成数据,无法有效捕捉真实环境中复杂且动态演化的攻击模式。其解决方案的关键在于提出一个自演化防御框架——BraveGuard,该框架通过从开放世界威胁信号和真实代理行为轨迹中持续挖掘新兴风险与攻击模式,将其转化为可执行的任务并收集代理的完整执行轨迹,从而构建轨迹级别的监督信号用于训练安全守卫模型(guard models)。该框架支持循环迭代,能够随新威胁和验证失败自动更新,形成动态适应性防御闭环。实验表明,基于BraveGuard训练的多种守卫模型(如Qwen3-Guard和Llama-Guard变体)在AgentHazard等轨迹级安全基准上显著提升检测性能,平均准确率从38.79%提升至82.38%,证明了基于真实代理行为与开放世界威胁发现的监督信号能有效超越固定分类体系和合成提示数据的局限,为应对不断演进的真实世界风险提供了可扩展的自适应安全防护路径。
链接: https://arxiv.org/abs/2606.01166
作者: Yunhao Feng,Yifan Ding,Xiaohu Du,Ming Wen,Xinhao Deng,Yanming Guo,Yuxiang Xie,Baihui Zheng,Yingshui Tan,Yige Li,Yutao Wu,Yixu Wang,Kerui Cao,Wenke Huang,Xingjun Ma,Yu-Gang Jiang
机构: Fudan University (复旦大学); National University of Defense Technology (国防科技大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign. We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories. BraveGuard mines recent research sources to identify emerging risks and attack patterns, instantiates them as executable computer-use tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. As new threats and validation failures appear, the pipeline can be repeated, yielding an adaptive defense loop rather than a static, benchmark-driven training process. We instantiate BraveGuard by training multiple guard backbones, including Qwen3-Guard and Llama-Guard variants, and evaluate the resulting guards on trajectory-level agent-safety benchmarks. BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting. These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data. BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.
[NLP-144] Not All Explanations Simulate Equally: Comparing Verbalized Feature Attributions and Self-Generated Rationales
【速读】: 该论文旨在解决自然语言解释(Natural-language explanations)作为模型行为理解统一接口时,不同解释来源在支持反事实模拟(counterfactual simulation)能力上的差异问题。尽管解释常被视为统一的可解释性工具,但其来源(如特征归因的口语化表达与自生成推理链)对模型行为的可模拟性具有显著影响。研究的关键在于通过一个共享的反事实模拟场景,利用大语言模型(LLM)裁判作为预测器,评估不同解释类型在预测模型对后续问题回答时的表现。核心发现表明,解释格式与特征粒度是影响可模拟性的关键因素:基于归因的解释与自生成推理链在提升反事实预测准确性方面表现各异,且这种差异在不同模型和解释形式间呈现显著异质性。
链接: https://arxiv.org/abs/2606.01148
作者: Pingjun Hong,Benjamin Roth
机构: University of Vienna(维也纳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Natural-language explanations are often treated as a unified interface for understanding model behavior, but different explanation sources may support simulation in different ways. This paper compares two families of explanations for question answering models: verbalized feature attributions and self-generated rationales. We evaluate them under a shared counterfactual simulation setting, using an LLM judge as predictor and measuring whether it can better predict a model’s answers to follow-up questions when given its explanation. Across multiple instruction-tuned models, we analyze how explanation source, verbalization strategy, and feature granularity affect the simulatability of explanations. Our results show that explanation format and granularity affect simulatability: attribution-based explanations and self-generated rationales differ in how much they improve counterfactual prediction, with effects that vary across models and formats.
[NLP-145] From Outliers to Errors: Auditing Pali-to-English LLM Translations with Multi-Reference Adjudication
【速读】: 该论文旨在解决单一分数翻译评估指标在古典语言翻译中将合法的语言变体误判为错误的问题,尤其针对巴利语等古典语言中同一段落存在多种合理英文译法的特性。其解决方案的关键在于构建一个“局部参考包络”(local reference envelope),即基于三位权威译者(Bhikkhu Sujato、Thanissaro Bhikkhu、Bhikkhu Bodhi)的人工翻译作为多维度参照基准,而非依赖单一“黄金标准”。通过计算候选译文与参考包络中心的归一化嵌入漂移(normalized embedding drift)作为初步筛选信号,仅将高漂移样本(>1.5)纳入后续人工审定流程,避免对所有输出进行全量标注。进一步地,采用由三模型组成的盲评判别小组(blinded three-model LLM judge panel)对高漂移样本进行校准审定,并基于300个经作者确认的验证实例进行模型校准。研究发现,嵌入漂移主要反映翻译严重性而非绝对错误,且不同大语言模型(LLM)在高漂移尾部表现差异显著:GPT-5.5在高漂移样本中的重大错误率最低,而Grok 4.3虽产生最多异常值且尾部重大错误率最高(>3.0时达74.4%),其典型错误类型如遗漏或截断、教义术语误译等,极易误导读者。因此,该研究贡献了一个可复用的古典语向现代语翻译审计框架:以多译者构成局部参考包络,利用嵌入漂移实现高效优先级排序,仅对高风险尾部样本进行精细化审定,从而区分合法变体与实质性错误。
链接: https://arxiv.org/abs/2606.01136
作者: Máté Metzger,Nadnapang Phophichit,Hansa Dhammahaso
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint. This manuscript has not yet been peer reviewed
Abstract:Single-score translation metrics can conflate legitimate variation with error, a problem especially acute for classical languages where multiple defensible English renderings of the same passage coexist. We audit Pali-to-English output from four flagship large language models (LLMs): GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, and Grok 4.3, on 1,700 passages from the Pali Canon, using three established human translations by Bhikkhu Sujato, Thanissaro Bhikkhu, and Bhikkhu Bodhi as a local reference envelope rather than a single gold standard. Each candidate’s normalized embedding drift from the reference centroid serves as a triage signal, not an error label; the 1,203 candidates above a 1.5 drift threshold are then adjudicated by a blinded three-model LLM judge panel, calibrated against a 300-instance author-adjudicated validation set. Two results stand out. First, drift predicts severity rather than error per se: the major-error rate among adjudicated high-drift candidates rose monotonically from 7.9% in the 1.5-2.0 band to 51.6% above 3.0, while approximately 80% of 1.5-2.0 outliers were judged valid translation variations. Second, model differences were clearest in the high-drift tail: GPT-5.5 had the lowest adjudicated high-drift major-error rate, with confidence intervals overlapping those of Claude Sonnet 4.6 and Gemini 3.1 Pro; Grok 4.3 had both the largest outlier volume and the highest tail major-error rate (27.6% overall, 74.4% above drift 3.0). The dominant major-error categories (e.g. omission or truncation, doctrinal term errors) are precisely the failures most likely to mislead readers of doctrinal text. The contribution is a reusable audit design for classical-to-modern translation: define a local reference envelope from multiple human translators, use embedding drift to prioritize review, and adjudicate the flagged tail rather than treating outlier status as error.
[NLP-146] Digging Up Citations: FOSSIL a Dataset and Workflow for Reference Extraction in Law and the Humanities
【速读】: 该论文旨在解决法律与人文学科中参考文献提取难题,其核心挑战在于这些领域的学术文献主要采用脚注形式引用,而脚注中的书目信息常与评论、交叉引用交织在一起,且在语言和格式上存在显著多样性。现有针对自然科学领域末尾结构化参考文献的提取工具(如Grobid)难以有效处理此类复杂场景。为此,研究提出FOSSIL(Footnote-based Open-access SSH Scientific Instance Labels)——一个开放许可的多语言数据集,包含96篇经标注的学术文章及超过7,600个嵌入脚注的参考文献条目,并配套开发了PDF-TEI Editor协作标注工具、七名标注员的标准工作流程以及专用于脚注引用的Grobid扩展模块。实验结果表明,该专用处理流水线在端到端评估中将提取性能几乎提升一倍(微平均F1从0.36提升至0.72),主要得益于召回率的显著改善;然而,对于交叉引用和混合内容脚注仍存在较大优化空间。当前工作仍在持续进行中,包括对引用分割、解析及交叉引用消解的进一步标注与建模。
链接: https://arxiv.org/abs/2606.01109
作者: Luca Foppiano,Christian Boulanger
机构: ScienciaLAB(葡萄牙); Max Planck Institute for Legal History and Legal Theory(德国)
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL)
备注: This is an extended abstract, peer-reviewed and presented at CiteX2026 this https URL
Abstract:Citation extraction tools are designed for the structured end-of-document bibliographies of the natural sciences, but law and humanities scholarship cites references primarily in footnotes, where bibliographic data is interleaved with commentary and cross-references and varies widely across languages and styles. To address the scarcity of suitable gold-standard resources, we present FOSSIL (Footnote-based Open-access SSH Scientific Instance Labels), an openly licensed multilingual dataset of 96 annotated scholarly articles containing over 7,600 footnote-embedded references, together with PDF-TEI Editor (a collaborative web annotation tool), a documented seven-annotator workflow, and a Grobid specialization for footnote-based citations. In end-to-end evaluation, the specialized pipeline nearly doubles extraction quality over default Grobid (micro-F1 from 0.36 to 0.72), driven largely by improved recall, while showing that substantial headroom remains for cross-references and mixed-content footnotes. This extended abstract presents work in progress; annotations of citations segmentation and parsing, and cross-reference resolution are ongoing.
[NLP-147] MiCU: End-to-End Smart Home Command Understanding with Large Language Model
【速读】: 该论文旨在解决智能家居生态系统中命令理解系统在处理模糊或语义不对齐指令(如“让卧室变得舒适”)时表现不佳的问题。尽管大语言模型(LLM)在跨领域泛化能力上优于传统规则系统,但其实际应用受限于领域特定数据稀缺、任务适配不足以及高计算开销。为此,论文提出一种基于用户日志与大语言模型的自动化训练数据合成流程,并构建了专用于命令理解的领域特定大模型MiCU。其核心解决方案包括:采用课程学习(curriculum learning)将领域知识注入基础模型,通过冷启动训练结合强化学习(RL)并以领域特定思维规则为指导,显著提升模型推理能力;同时引入令牌压缩技术,将设备描述压缩为单一特殊标记,大幅降低推理开销,从而实现针对长输入优化的高效变体Model-Fast。实验表明,MiCU在所有设备类别上的平均准确率提升达20.01%。在小米家庭应用中的实际部署结果显示,用户修正率降低1.57%,人工审核准确率提升32.05%,验证了其在真实场景下的优越性能。
链接: https://arxiv.org/abs/2606.01099
作者: Haowei Han,Kexin Hu,Weiwei Cai,Debiao Zhang,Bin Qin,Yuxiang Wang,Jiawei Jiang,Xiao Yan,Bo Du
机构: Wuhan University (武汉大学); Xiaomi Corporation (小米公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., “turn on the bedroom light”), they struggle with ambiguous or misaligned commands (e.g., “make the bedroom cozy”). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at this https URL
[NLP-148] Deep Research as Rubric for Reinforcement Learning
【速读】: 该论文旨在解决开放性推理与长文本生成任务中缺乏可靠自动验证信号的问题,尤其是在基于奖励的策略优化(reward-based policy optimization)过程中,传统评分标准(rubric)往往作为静态、预设的评估模板,存在忽视任务特异性与知识密集型维度的缺陷,导致奖励信号失真。其解决方案的关键在于提出一种名为“深度研究即评分标准”(Deep Research as Rubric, DR-rubric)的两阶段框架:第一阶段通过多轮代理式搜索(multi-turn agentic search)迭代挖掘领域事实、结构约束及失败模式;第二阶段将所获证据提炼为原子化、可独立验证的约束条件,以支持基于广义奖励策略优化(GRPO)的模型训练。该方法的核心创新在于将评分标准构建本身视为一项研究过程,利用待训练模型自身作为评分标准生成器,实现无需前沿模型辅助的自举式(bootstrap)评分标准生成。实验在6个涵盖代理式研究与专家推理的任务上验证了DR-rubric的有效性,结果显示仅需1K–3K训练样本即可达到强竞争力表现,其中GPT-5生成的评分标准在代理任务上覆盖更广,Gemini生成的评分标准在两类任务间平衡性最佳,而自举生成的评分标准呈现从专业化到再平衡的演化趋势,在第三轮迭代时取得最优综合性能。结果表明,将评分标准构建从静态评估模板重构为基于证据的研究流程,能够为开放性任务提供更具可扩展性与细粒度的奖励信号。
链接: https://arxiv.org/abs/2606.01091
作者: Wangyi Mei,Zhouhong Gu,Zhenhan Bai,Yin Cai,Lefan Zhang,Zhenxin Ding,Bo Chen,Yan Gao,Yi Wu,Yao Hu,Jiaqing Liang,Deqing Yang
机构: Fudan University (复旦大学); Xiaohongshu Inc. (小红书); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Open-ended reasoning and long-form generation tasks lack reliable automatic verification signals for reward-based policy optimization. Rubrics offer a promising alternative, but existing approaches treat them as given artifacts – either hand-crafted or prompt-generated – and often miss the task-specific, knowledge-intensive dimensions that matter most, distorting the reward signal. Our key observation is that rubric construction is itself a research problem: identifying what makes a response correct or insightful requires discovering and synthesizing external knowledge. We propose Deep Research as Rubric (DR-rubric), a two-stage framework for constructing such rubrics. Stage I elicits domain facts, structural constraints, and failure modes through iterative multi-turn agentic search; Stage II distills this evidence into atomic, independently verifiable constraints for GRPO-based policy optimization. Because the model under training can serve as its own rubric generator, DR-rubric-8B supports bootstrap rubric generation without frontier-model assistance. We evaluate on 6 benchmarks spanning agentic research and expert reasoning. Experiments show that DR-Rubric achieves strong competitive performance with only 1K – 3K training instances, where GPT-5-generated rubrics particularly benefit breadth coverage on agentic tasks, Gemini-generated rubrics yield the most balanced performance across agentic and expert reasoning tasks, and bootstrap rubrics exhibit a specialization-to-rebalancing evolution achieving the best overall performance at the third iteration. Results demonstrate that reframing rubric construction from static evaluation templates into an evidence-driven research process yields more scalable, fine-grained reward signals for open-ended tasks.
[NLP-149] On the Generalization Gap in Self-Evolving Language Model Reasoning
【速读】: 该论文旨在解决在严格闭环设置下,仅依赖未标注提示集和基础模型时,生成式AI(Generative AI)通过自演化(Self-Evolution, SE)所生成的内部监督信号,究竟能够逼近理想有监督训练(oracle-supervised training)的程度。其核心问题是:在缺乏外部标注或真实标签的情况下,自演化能否有效生成高质量的监督信号以持续提升模型性能?解决方案的关键在于评估四种代表性自演化策略——单轮验证、多轮反馈修正、迭代训练与课程学习——在统一离线框架下的表现,并通过具有确定性解、可控难度等级的“骑士与说谎者”(Knights and Knaves, KK)逻辑推理任务进行系统分析。研究发现,尽管多轮批评-修正机制结合大模型可显著提升性能(如Gemma 12B接近有监督基准),但性能仍存在不可忽视的差距,且在过度投入计算资源后趋于饱和。此外,在真实世界推理基准上的增益亦有限。因此,研究揭示了当前闭环自演化中内部生成监督的局限性,表明其在无外部参考时难以完全替代有监督学习。
链接: https://arxiv.org/abs/2606.01075
作者: Zhenting Qi,Susanna Maria Baby,Stefanie Anna Baby,Kan Yuan,Andrew Tomkins,Tu Vu,Da-Cheng Juan,Cyrus Rashtchian
机构: Google Research(谷歌研究); Harvard University (哈佛大学); Google(谷歌); Virginia Tech (弗吉尼亚理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent work suggests that large language models (LLMs) can improve through self-evolution (SE), using supervision signals generated by the model itself. In this work, we ask: under a strict closed-loop setup, where the self-evolution algorithm has access only to an unlabeled prompt set and a base model, how close can internally generated supervision come to oracle-supervised training? We analyze four representative strategies in a unified offline self-evolution framework: single-round verification, multi-turn revision with feedback, iterative training, and curriculum learning. Our primary experiments use Knights and Knaves (KK) logical reasoning tasks, which provide deterministic solutions, controlled difficulty levels, and a clean testbed for easy-to-hard generalization. We first show that self-evolution consistently improves over the base model, but plateaus after excessive training compute is invested, and eventually still leaves a non-trivial gap to oracle supervision. We find that multi-turn critic-revision with large models can reach strong self-evolution performance, with Gemma 12B nearly matching oracle-supervised training. Beyond Knights and Knaves, we also evaluate self-evolution on real-world reasoning benchmarks, where gains are also modest. Overall, our results characterize when closed-loop self-evolution can help and show how internally generated supervision remains insufficient under this minimal formulation.
[NLP-150] When Is 0.1% Enough? Analyzing the Combined Effects of Dimensionality Reduction and Quantization on Text Embedding Compression
【速读】: 该论文旨在解决高性能文本嵌入模型生成的高维实值向量所带来的存储与计算开销问题。现有压缩方法多采用维度缩减或量化技术,但二者联合使用的效能尚未得到充分研究。本文系统评估了结合维度缩减与量化技术对文本嵌入进行压缩的有效性,基于四个MTEB任务族及四类预训练嵌入模型开展实验。结果表明,联合使用维度缩减与量化可实现远优于单一方法的压缩效果,在部分场景下嵌入向量尺寸可压缩至原始大小的0.1%且几乎不损失性能,同时最优压缩策略具有任务依赖性。其解决方案的关键在于协同优化维度缩减与量化,以实现高效、低损耗的嵌入压缩。
链接: https://arxiv.org/abs/2606.01074
作者: Riku Kisako,Hayato Tsukagoshi,Ryohei Sasano
机构: Nagoya University (名古屋大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent high-performing text embedding models often output high-dimensional real-valued vectors, resulting in substantial storage and computational costs. To address this issue, compression methods based on dimensionality reduction or quantization have been proposed; however, the effects of combining dimensionality reduction and quantization have not been sufficiently investigated. In this paper, we systematically examine the effectiveness of compressing text embeddings by combining dimensionality reduction and quantization, using four MTEB task families and four pretrained embedding models. The experimental results demonstrate that combining dimensionality reduction and quantization enables substantially stronger compression than using either method alone, that in some settings embeddings can be reduced to as little as 0.1% of their original size with almost no performance degradation, and that the optimal compression strategy depends on the task.
[NLP-151] MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models EMNLP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在偏好对齐(Preference Alignment, PA)过程中内部表征变化不明确的问题。尽管偏好对齐显著提升了模型的行为表现,但其在对抗性攻击(如越狱、提示注入、检索时扰动)下的失效表明,仅依赖行为层面的评估无法全面揭示对齐带来的内在机制演化。为此,论文提出一种以几何结构为核心的分析框架MENTIS,用于量化指令微调(Instruction-Tuned, IT)模型向偏好对齐(PA)模型转变时的内部计算重构。其核心解决方案在于引入三个关键指标:基于层间协方差的扭转范数(T1)、谱级扭转诊断(T2)以及能量-辐射-激活度量(ERA),实现对对齐诱导的内部几何重组的系统性测量。研究发现,对齐引发的变化具有选择性而非均匀分布:规范性概念(normative concepts)相较于事实性概念表现出更大的扭转偏移;扭转程度与上下文熵呈负相关;且峰值效应集中于架构特定的中后期层。该模式在词级、提示级及模型级分析中均一致出现,表明偏好对齐会在内部计算中留下结构化、深度局部化的几何痕迹,这些特征超越了行为层面评估所能揭示的范围。
链接: https://arxiv.org/abs/2606.01060
作者: Partha Pratim Saha,Samarth Raina,Mayur Parvatikar,Amit Dhanda,Vinija Jain,Aman Chadha,Amitava Das
机构: Pragya Lab, BITS Pilani Goa, India; IIIT Delhi, India; Amazon, USA; Google, USA; Google DeepMind, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to EMNLP 2026
Abstract:Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal. Comments: Submitted to EMNLP 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.01060 [cs.CL] (or arXiv:2606.01060v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.01060 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-152] PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining
【速读】: 该论文旨在解决现有生物医学图文数据集在用于医疗多模态模型持续预训练(CPT)时存在的核心问题:图注(caption)信息不完整、上下文依赖性强且缺乏文章正文支持,导致图文对语义不连贯;同时,大规模自动抽取引入了结构噪声,如缺失图注、残留标记、重复内容及非连贯的多段描述。其解决方案的关键在于提出一种基于上下文的生物医学交错语料库PMC-InterCPT,通过整合与图表相关的正文文本(body text)以增强图注的语义完整性,并构建端到端的数据清洗与重构管道:包括恢复缺失图注、清理文本噪声、重建连贯的图文交错样本,并利用大语言模型(LLM)监督的医学相关性与质量分类器过滤低质量记录。此外,研究揭示了语料库中显著的模态不平衡问题,进而提出四类证据分类体系(four-bucket evidence taxonomy),实现模态感知的重采样策略。实验表明,基于Qwen3.5-4B-Base模型的持续预训练结合有监督微调(SFT),PMC-InterCPT在减少使用原始数据量的同时,显著提升了医疗及通用多模态任务性能,验证了数据质量与模态平衡在医疗多模态持续预训练中的互补作用。
链接: https://arxiv.org/abs/2606.01049
作者: Guanghao Zhu,Zeyu Liu,Zhitian Hou,Pengkai Wang,Zhijie Sang,Minheng Ni,Wenjun Wang,Yanggan Gu,Shuo Cai,Congkai Xie,Jianmin Wu,Hongxia Yang
机构: The Hong Kong Polytechnic University (香港理工大学); Sun Yat-sen University (中山大学); InfiX.ai; PolyU-Daya Bay Technology and Innovation Research Institute
类目: Computation and Language (cs.CL)
备注:
Abstract:Large-scale biomedical image-text datasets extracted from scientific literature provide valuable resources for medical multimodal model training. These datasets are commonly organized as image-caption pairs; however, figure captions are often short, context-dependent, and only partially informative without the surrounding article text. At the same time, large-scale automatic extraction introduces structural noise such as missing captions, residual markup, duplicated context, and incoherent multi-paragraph figure descriptions. We revisit data construction for medical multimodal continued pretraining (CPT) and present PMC-InterCPT, a context-grounded biomedical interleaved corpus that incorporates figure-referencing body text in addition to captions. Our pipeline recovers missing captions, cleans caption and context text, reconstructs coherent interleaved image-text samples, and applies LLM-supervised medical relevance and quality classifiers to filter noisy records. We further reveal strong modality imbalance in the resulting corpus and introduce a four-bucket evidence taxonomy for modality-aware resampling. Through CPT followed by supervised fine-tuning (SFT) on Qwen3.5-4B-Base, PMC-InterCPT effectively improves medical and general multimodal performance while using fewer CPT tokens than the raw source pool. The experimental results also illustrate the complementarity between the data quality and modality for medical multimodal CPT.
[NLP-153] Child-directed speech facilitates production not comprehension in BabyLMs CONLL2026
【速读】: 该论文旨在解决当前对儿童导向语料(Child-Directed Speech, CDS)在婴儿语言模型(BabyLMs)中作用评估的局限性问题,即现有评估主要聚焦于理解能力(comprehension),而忽视了生成能力(production),而后者正是使用基于理论(usage-based theories)语言习得的核心。其解决方案的关键在于提出一种基于建构性“框架”(constructional frames,指频繁出现的词汇模式并带有可填充槽位)的新型生成式评估范式——帧补全任务(frame-completion task),以更真实地反映CDS在促进早期语言使用中的作用。实验结果表明,尽管基于网络爬取数据(FineWeb-edu)训练的模型在最小对辨识等理解任务中表现更优,但仅使用CDS训练的模型在生成层面表现出显著优势:其更早实现语法正确补全,并更集中地将概率分配给合适的槽位填充词。这一发现揭示了传统理解基准可能低估了CDS对婴儿语言模型生成能力的潜在价值。
链接: https://arxiv.org/abs/2606.01045
作者: Bastian Bunzeck,Sina Zarrieß
机构: Bielefeld University (比勒费尔德大学)
类目: Computation and Language (cs.CL)
备注: Accepted at CoNLL 2026
Abstract:Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ‘‘frames’’ (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a frame-completion task, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models’ comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.
[NLP-154] ExpWeaver: LLM Agents Learn from Experience via Latent RAG
【速读】: 该论文旨在解决现有基于经验学习(Experience Learning)方法在大语言模型(LLM)智能体规划与推理中面临的两大核心问题:一是现有方法局限于显式文本空间,依赖语义相似性检索并拼接历史经验,导致严重的令牌(token)开销;二是检索与生成模块分离的架构设计,造成系统耦合性差、效率低下。其解决方案的关键在于提出ExpWeaver框架,通过隐式空间中的检索增强生成(Retrieval-Augmented Generation, RAG)机制,实现端到端可优化的经验学习。ExpWeaver利用大语言模型自身的隐藏状态对经验进行编码,在解码过程中直接于隐空间内检索相关经验,并通过交叉注意力聚合与门控残差机制融合信息,无需独立的RAG模块。整个流程以强化学习(Reinforcement Learning)进行端到端训练,支持生成与排序等多任务。实验表明,ExpWeaver在13项跨领域任务(涵盖问答、推理、编程、科学预测与推荐)中表现卓越,12项任务达到当前最优性能,优于最强基线超过6.8%;在保持与非检索基线相当的令牌效率的同时,显著优于传统文本检索方法(后者需1.5至2倍更多令牌);且在零样本和少样本跨域迁移场景下分别领先16.32%和15.21%,展现出优异的泛化能力。
链接: https://arxiv.org/abs/2606.01041
作者: Tao Feng,Tianyang Luo,Jingjun Xu,Zhigang Hua,Yan Xie,Shuang Yang,Ge Liu,Jiaxuan You
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Experience learning has achieved promising results in enhancing LLM agent planning and reasoning by integrating past interactions as reusable knowledge. However, existing methods remain confined to explicit text space, retrieving experiences via semantic similarity and concatenating them into the context window, leading to substantial token overhead and a decoupled architecture that separates retrieval from generation. To address these limitations, we propose ExpWeaver, a framework that enables LLM agents to learn from experience via latent retrieval-augmented generation, without requiring a separate RAG module. ExpWeaver encodes experiences using the LLM’s own hidden states, retrieves relevant experiences directly in latent space at each decoding step, and integrates them through cross-attention aggregation and gated residual mechanisms. The entire pipeline is optimized end-to-end with reinforcement learning, supporting both generative and ranking tasks. We evaluate ExpWeaver on 13 diverse tasks spanning question answering, reasoning, coding, scientific prediction, and recommendation. Results demonstrate that ExpWeaver achieves state-of-the-art performance on 12 out of 13 tasks, outperforming the strongest baseline by over 6.8%; maintains token efficiency comparable to non-retrieval baselines while text-based retrieval methods require 1.5 to 2 times more tokens; and exhibits superior cross-domain generalization, outperforming the strongest baseline by 16.32% under zero-shot transfer and 15.21% under few-shot transfer. Our code for ExpWeaver is released at this https URL.
[NLP-155] A Finite-Calibration Regime Map for LLM Judge Panels
【速读】: 该论文旨在解决在有限人类标注预算下,大型语言模型(LLM)评判小组应采用低维堆叠器(low-dimensional stackers)还是联合输出表(joint output tables)进行校准的问题。其核心挑战在于:低维堆叠器虽具有较低的估计成本,但无法捕捉评判者之间的交互效应;而联合表校准器虽能建模复杂交互,却因单元格数量和未见模式带来的高计算与数据需求而面临可扩展性瓶颈。论文将这一权衡关系形式化为一个有限校准范式图(finite-calibration regime map),并提出可部署的“有限校准评判小组选择”(Finite-Calibration Panel Selection)方法,该方法通过判别性诊断(包括表格与参数化估计诊断)对评判路径、前缀长度及聚合家族进行选择。在RewardBench、LLMBar、SummEval和Arena100K等数据集上的实验表明,在7名评判者(含DeepSeek V4 Flash)配置下,标量/可靠性聚合在20个真实数据-预算组合中胜出16次,表明当前评判输出往往具有加性或冗余特性。控制实验进一步揭示互补区间:尽管加性标签仍偏好标量聚合,但当存在六元交互时,更大的联合表在未见样本质量消失后,测试均方误差从0.224显著降至0.061。因此,实际问题并非“需要多少评判者”,而是“在现有标注资源下,新增评判者的贡献是否可被有效估计”。
链接: https://arxiv.org/abs/2606.01034
作者: Bin Zhu,Yanghui Rao
机构: Sun Yat-sen University (中山大学)
类目: Computation and Language (cs.CL); Methodology (stat.ME)
备注: Work in Progress
Abstract:We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset–budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?‘’ but whether the next judge’s information is estimable under the available human labels.
[NLP-156] Revise Dont Freeze: Sampler-Matched Training for Self-Correcting Masked Diffusion Language Models
【速读】: 该论文旨在解决掩码扩散语言模型(MDLMs)在去噪过程中存在的一种关键缺陷:标准采样器在解码时一旦确定某个位置的词元(token)即予以固定,导致模型无法利用其在后续步骤中对已暴露词元进行修正的能力。尽管生成式模型具备逐步修正的能力,但现有方法要么引入启发式或学习机制来重写已提交的词元,要么通过重新掩码回[MASK]实现再预测,均需额外模块或计算开销。为此,本文提出D3IM——一种无需参数的采样器,基于校正风格的逆向更新过程,可直接对已可见词元进行可见到可见的修正,无需辅助模块或额外前向传递。此外,研究发现模型存在“保持偏差”(preservation bias)现象,即倾向于重复自身错误的已提交词元而非纠正。为应对此问题,提出了轻量级后训练方法SCOPE(Self-Conditioned On Prediction Errors),通过模拟D3IM的采样流程,使模型学会识别并修正预测误差。在LLaDA-8B模型上,使用64个去噪步骤时,SCOPE+D3IM相较于原始模型在GSM8K、MATH-500、HumanEval和MBPP上的表现分别提升13.0、4.8、15.3和10.4个百分点,且数学类与HumanEval任务的增益随去噪步数增加而扩大,验证了该方案的有效性与可扩展性。
链接: https://arxiv.org/abs/2606.01026
作者: Longxuan Yu,Shaorong Zhang,Yu Fu,Hui Liu,Yue Dong,Greg Ver Steeg
机构: University of California, Riverside; Microsoft
类目: Computation and Language (cs.CL)
备注: 8 pages, 2 figures, 10 tables
Abstract:Masked diffusion language models (MDLMs) re-predict every position at each denoising step, but standard samplers commit tokens once revealed, leaving this revision capability unused. Existing approaches either add heuristic or learned mechanisms to revise committed tokens, or remask them back to [MASK] before re-predicting; a principled sampler that directly revises visible tokens without auxiliary modules remains underexplored. We introduce D3IM, a parameter-free sampler derived as a corrector-style reverse update that permits direct visible-to-visible revision without additional modules or auxiliary passes. D3IM also reveals a model-side obstacle we term preservation bias: the model tends to reproduce its own wrong committed tokens rather than correct them. We address this with SCOPE (Self-Conditioned On Prediction Errors), a lightweight post-training procedure that simulates D3IM’s sampling process. On LLaDA-8B at 64 denoising steps, SCOPE+D3IM improves over the original LLaDA-8B with standard unmasking by +13.0 on GSM8K (68.3%), +4.8 on MATH-500 (23.6%), +15.3 on HumanEval (29.3%), and +10.4 on MBPP (30.8%), with gains that increase as more denoising steps are used on math and HumanEval.
[NLP-157] DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs
【速读】: 该论文旨在解决离散掩码扩散语言模型(Discrete Masked Diffusion Language Models, DLM)在少步推理(few-step decoding)时面临的长度与质量之间的权衡问题:固定步数预算下,传统方法要么生成短但高质量的文本,要么生成长但重复性高的文本。其解决方案的关键在于通过轻量级微调将预训练的离散掩码DLM(如LLaDA-8B-Instruct)适配为支持连续嵌入空间去噪(continuous denoising)的模型。具体而言,作者采用离散随机定位(Discrete Stochastic Localization, DSL),以每标记的高斯噪声作为软掩码替代原有的二值掩码,在仅1,000步的持续预训练后,使模型能够在嵌入空间中联合演化所有位置,并将硬标记决策推迟至最后一步。该方法实现了连续推理,显著缓解了提前终止与重复性的权衡问题,在低步数预算(≤16次前向传播)下的零样本摘要任务中,于四个基准上均取得了最佳的ROUGE-1得分。此外,该适配还赋予模型对噪声状态的选择性鲁棒性,即能够修正被污染的标记同时保留未受损的干净标记,而使用标准掩码扩散训练的对照实验未能实现上述特性。
链接: https://arxiv.org/abs/2606.01024
作者: Longxuan Yu,Yunshu Wu,Yu Fu,Siheng Xiong,Rob Brekelmans,Hui Liu,Yue Dong,Greg Ver Steeg
机构: University of California, Riverside(加州大学河滨分校); Georgia Institute of Technology(佐治亚理工学院); Microsoft(微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, 28 tables
Abstract:Discrete Masked diffusion language models generate text by iterative parallel decoding, but few-step decoding suffers from a tradeoff between length and quality: with a fixed step budget, standard methods can generate a short, high-quality output, or they can produce long but repetitive text. Continuous denoising can sidestep this tradeoff by evolving all positions jointly in embedding space, but building such a model from scratch at scale remains an open problem. We show that a pretrained masked DLM can instead be lightly adapted to support continuous embedding-space denoising. Starting from LLaDA-8B-Instruct, we continue-pretrain for only 1,000 steps with Discrete Stochastic Localization (DSL), replacing binary masking with continuous per-token Gaussian noise as a soft mask. The adapted model supports continuous inference that evolves all positions jointly in embedding space and defers hard token commitment to the final step. On zero-shot summarization at low step budgets (=16 forward passes), DSL-LLaDA-SDE achieves the best ROUGE-1 on all four benchmarks and largely avoids the premature-termination / repetition tradeoff of iterative unmasking. The same adaptation also yields selective noisy-state robustness: the model corrects corrupted tokens while preserving clean ones. Control experiments using standard masked diffusion training with the same compute demonstrate neither behavior.
[NLP-158] Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生成过程中的高计算成本问题,尤其是在自回归解码中每次生成一个新标记所导致的延迟。现有方案如推测解码(Speculative Decoding)通过预先草拟多个标记并一次性验证以提升效率,但其加速效果依赖于草稿标记被接受的数量。针对无参数草稿源在结构化和代理型工作负载中虽可低成本生成长序列,但其收益随生成步骤变化、难以预测的问题,本文提出混合验证解码(Hybrid Verified Decoding)方法。其核心创新在于:在验证前预测缓存草稿的接受长度,并基于该收益估计动态选择使用缓存验证或基于模型的草稿器。实验表明,在三个LLM和十六个数据集上,该方法在代理型工作流中表现尤为突出,相较EAGLE3实现了平均2.73倍的加速。分析揭示了提示结构如何创造缓存机会,高收益草稿集中于草稿空间的极小部分,且基于收益的筛选机制显著减少了序列解码的工作量,表明运行时草稿选择是推测解码未来发展的重要方向。
链接: https://arxiv.org/abs/2606.01019
作者: Xin Su,Dawid Majchrowski,Fangyuan Yu,Vanshil Atul Shah,Sebastian Rogawski,Pawel Morkisz,Anahita Bhiwandiwalla,Phillip Howard
机构: Thoughtworks(思特沃克); Nvidia(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM) generation remains expensive because autoregressive decoding calls the model once for each new token. Speculative decoding reduces this cost by drafting multiple tokens and verifying them with the target model in one step, but its speedup depends on how many drafted tokens are accepted. Parameter-free draft sources can propose long continuations at low cost in structured and agentic workloads, yet a cache match that looks promising at one generation step may have low payoff at the next. We propose Hybrid Verified Decoding, which predicts the accepted length of a cache draft before verification and uses this payoff estimate to choose between cache verification and a model-based drafter. Across three LLMs and sixteen datasets, Hybrid Verified Decoding is especially effective on agentic workflows, where it outperforms EAGLE3 in every setting with a 2.73x average speedup. Our analysis shows how prompt structure creates cache opportunities, how high-payoff cache drafts concentrate in a small part of the draft space, and how payoff-guided selection reduces sequential decoding work, pointing to runtime draft selection as a promising direction for speculative decoding.
[NLP-159] PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100 Languages and Dialects KDD2026
【速读】: 该论文旨在解决当前端到端(End-to-End, E2E)语音-大语言模型(Speech-Large Language Models, Speech-LLMs)评估体系滞后于技术发展的问题,特别是现有基准测试在语言资源分布、任务层级和方言覆盖方面的显著局限性。具体而言,现有评测存在三大缺陷:对高资源语言的明显偏倚、过度关注低层次语音识别(ASR)而忽视语义推理能力,以及对区域方言的忽略。为克服这些不足,研究提出PolySpeech-100——一个大规模多语言语音理解基准,涵盖110种语言变体,包括19种中国方言和80余种低资源语言,以评估模型在“母语级”语音理解上的表现。其解决方案的关键在于采用一种创新的混合构建流程,将高质量人工录音与指令驱动的合成语音相结合,有效扩展了数据覆盖范围并提升了多样性。通过在22个先进模型(如Gemini-3、GPT-Audio、Qwen2.5-Omni)上的系统评估,研究揭示了若干关键发现:首先,开源E2E模型在复杂方言上优于级联式(ASR+LLM)系统,证明直接音频处理能更好保留韵律特征(如语调、重音)等关键副语言线索;其次,商业模型在低资源语言上表现稳健,而开源模型则出现灾难性性能下降;最后,出人意料地发现,在标准零样本设置下,链式思维(Chain-of-Thought)提示常导致多数模型的语音理解性能下降,暴露出当前架构在模态对齐方面存在的深层缺陷。PolySpeech-100为下一代包容性强、全模态能力的Speech-LLMs设立了严谨的评估标准,相关数据、演示与代码已公开共享。
链接: https://arxiv.org/abs/2606.01016
作者: Sicheng Yang,Shulan Ruan,Shiwei Wu,Yu Liu,Lu Fan,Zhi Li,You He
机构: Tsinghua University; JD AI Research; Shenzhen International Graduate School, Tsinghua University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 19 pages, 13 figures, KDD 2026
Abstract:While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level’ speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at this https URL.
[NLP-160] rust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher ICML2026
【速读】: 该论文旨在解决在缺乏可靠标签的情况下,如何利用弱教师(weaker teacher)的监督信号来提升强学生模型(strong student)性能的问题,即弱教师到强学生(weak-to-strong)泛化问题。其核心挑战在于识别哪些来自弱教师的弱标签(weak labels)具有足够的可靠性以作为有效的训练信号。为此,论文提出引入“信任函数”(trust functions),为每个弱标签分配一个标量信任得分,并基于该得分对弱监督信号进行过滤。这一方法在世界知识、定量推理和策略游戏等多个领域均表现出色,所训练的学生模型性能达到甚至超越真实标签(ground-truth supervision)的水平,实现了近乎无损的弱到强泛化。此外,信任函数支持构建迭代式弱到强链,通过将学生模型反复重用为下一阶段的教师,持续放大性能增益。信任函数的优势可归因于多个机制,包括动态筛选高置信度标签、减少噪声干扰以及促进知识的渐进式累积。
链接: https://arxiv.org/abs/2606.01000
作者: Arda Uzunoglu,Alvin Zhang,Daniel Khashabi
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: ICML 2026
Abstract:Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.
[NLP-161] Decoding in Order-Agnostic Language Models: Chain-Rule Deviation and Uniform Spreading
【速读】: 该论文旨在解决生成式语言模型在无序条件(order-agnostic)推理下,因揭示顺序(reveal order)变化导致的似然值波动与生成质量不一致的问题。具体而言,现有顺序无关语言模型(Order-agnostic Language Models, OALMs),如离散扩散语言模型(dLLMs),虽可在任意揭示顺序下进行序列生成或评分,但其学习到的条件分布并非精确的联合分布因子化,揭示顺序的变化会显著影响目标对数似然(最大达0.49 nats/token),导致似然值同时混杂了内容难度与路径依赖性伪影。为应对这一问题,论文提出关键解决方案:引入基于置信度轨迹(confidence trace)形状的互补诊断方法。核心思想是基于“均匀扩散定理”——在总似然固定条件下,每步置信度均匀分布时目标恢复能力最强;由此引出以Var(logqt)(置信度对数方差)作为衡量解码路径结构性的指标。实验表明,在C4及四个下游任务上,低置信度方差可有效区分结构化路径与随机顺序,且与下游任务正确率高度相关。因此,论文主张在比较OALM解码路径时,应联合报告平均置信度与置信度方差,以更准确地评估路径质量。
链接: https://arxiv.org/abs/2606.00997
作者: Lin Yao
机构: Shanghai Jiao Tong University (上海交通大学); Zhongguancun Academy (中关村学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Order-agnostic language models (OALMs), including discrete diffusion language models (dLLMs), are trained to predict masked tokens under arbitrary conditioning sets, allowing sequences to be generated or scored under arbitrary reveal orders at inference time. In LLaDA-2.1, we report three findings. First, the learned conditionals are not exact factorizations of a coherent joint distribution: changing only the reveal order shifts target log-likelihood by up to 0.49 nats/token, so likelihood alone mixes content difficulty with path-dependent artifacts. Second, although confidence-first (CF) decoding is order-agnostic, its reveal orders are close to left-to-right (L2R) on content tokens. Third, we propose a complementary diagnostic based on the shape of the confidence trace. A uniform-spreading theorem shows that, at fixed total likelihood, target recoverability is maximized when per-step confidence is spread uniformly; the resulting deviation motivates \mathrmVar(\log q_t) as a diagnostic for comparing decoding paths. Across C4 and four downstream benchmarks, low variance separates structured paths from random ordering, and variance is consistently associated with downstream correctness. These results support reporting mean confidence and confidence variance jointly when comparing OALM decoding paths.
[NLP-162] A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants Aquatic Species and Exotic Pets
【速读】: 该论文旨在解决大规模、自动化从文献中提取可验证的物种性状数据(trait records)的挑战,尤其针对栽培热带植物、水生生物及宠物物种。其核心问题是:如何在保证数据规模的同时,确保生成的性状记录具备可审计性与证据依据,避免生成式模型常见的“幻觉”问题。解决方案的关键在于提出一个四机制协同的框架:(1)基于版本化的39个关键字段封闭词汇性状注册表(closed-vocabulary trait registry),对所有录入值施加类型约束;(2)每条记录附带原文引用(verbatim evidence quote),确保每个数值可追溯至原始文本片段;(3)为每条记录分配置信度标签(高或中等,低置信度在持久化前被丢弃);(4)支持多版本存档,实现历史版本可追溯。该框架使系统在处理超过40万种物种、548万条性状记录时,仍能保持高度可验证性,经三重验证层评估,证据支持率高达90%以上,且在抽样测试中表现完全一致,显著提升了生成式语言模型在科学知识库构建中的可信度与实用性。
链接: https://arxiv.org/abs/2606.00994
作者: Jeff Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 33 pages, 6 figures; methodology paper
Abstract:We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.
[NLP-163] Robust Asynchronous Planning via Auto-Formalization
【速读】: 该论文旨在解决大语言模型(LLM)在处理现实世界中异步、具有非均匀持续时间、并发性及执行时间约束的复杂规划任务时,现有基准测试体系缺乏全面覆盖的问题。其核心挑战在于如何设计可扩展且鲁棒的规划框架以应对动态环境中的约束变化。解决方案的关键在于采用基于约束满足问题(Constraint Satisfaction Problem, CSP)的通用形式化表示方法(如CP-SAT Formalizer),相较于依赖谓词逻辑的PDDL2.1形式化方法,其在大规模依赖图(从5到100个动作)下展现出显著更优的可扩展性与稳定性。实验表明,当依赖关系复杂度增加时,传统规划器(Planner)与PDDL2.1 Formalizer的准确率急剧下降,而CP-SAT Formalizer仍能保持高达83%的规划准确率;此外,通过引入仅更新事件触发约束的状态感知修复策略,可在执行时间约束动态变化的情况下将性能恢复至84.5%,验证了通用约束建模在提升规划系统鲁棒性方面的关键优势。
链接: https://arxiv.org/abs/2606.00981
作者: Jiayi Zhang,Jianing Yin,Ben Zhou,Li Zhang
机构: Drexel University (德雷塞尔大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs can plan by either generating action sequences directly as a Planner or translating tasks into domain specific language for an external solver as a Formalizer. While most real-world tasks are asynchronous with non-uniform durations, concurrency, and execution-time constraints, existing benchmarks hardly cover them. We unify these asynchronous planning challenges under a single formulation and introduce the first three benchmarks that address each at scale. We conclude that the choice of formal representation primarily determines whether planning scales: as dependency graphs grow from 5 to 100 actions, Planner collapses from 96% to 5% plan accuracy and PDDL2.1 Formalizer from 13% to 0%, while CP-SAT Formalizer averages 94% and still achieves 83% at 100 actions. Faithfulness diagnostics show that PDDL2.1’s predicate-based planning representation becomes brittle compared to general constraint satisfaction programs, when LLMs must keep predicates, effects, and goals consistent. Execution-time updates of planning constraints further degrade performance sharply (Planner 23.9%, PDDL2.1 0.7%, CP-SAT 46.1%), but a state-aware repair strategy that updates only event-induced constraints recovers CP-SAT Formalizer to 84.5%.
[NLP-164] Lost in Delusion: Examining LLM Safety Under User Delusions and Distress
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)聊天机器人在应对伴有妄想信念的心理危机情境时的安全性问题,尤其关注当个体的痛苦情绪与妄想内容交织于持续多轮对话中时,模型的表现缺陷。现有研究多聚焦于单轮危机检测或通用治疗质量评估,未能揭示模型在长期、复杂心理互动中的行为模式。本文通过构建基于临床真实人格设定的配对多轮模拟对话,将每一段包含妄想框架的对话与仅含情绪困扰的对照组进行对比,以分离妄想语境的影响。研究发现存在“识别-干预鸿沟”:尽管模型在识别心理痛苦方面表现稳定,无论是否处于妄想语境下,但一旦痛苦被嵌入妄想框架,其安全干预行为显著减弱,干预抑制最高可达4.5倍。这种失败并非源于情感共情不足,而是由于模型逐步接受用户前提所导致的累积性认知同化。进一步实验表明,简单提示模型评估用户状态反而加剧风险;唯有采用具备妄想意识的显式响应引导提示,并辅以可靠的妄想分类器,才能有效缩小该差距——然而当前最脆弱的模型自身所依赖的分类器性能亦不可靠。因此,论文提出:在实际部署中,必须将妄想框架视为独立的风险信号,优先于对话适应性处理,以确保安全干预的有效性。
链接: https://arxiv.org/abs/2606.00975
作者: Andrew Aquilina,Chetna Nihalani,Vasudha Varadarajan,Nathan S. Fishbein,Yu-Ru Lin,Maarten Sap
机构: University of Pittsburgh(匹兹堡大学); Carnegie Mellon University(卡内基梅隆大学); Fordham University(福特汉姆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM chatbots increasingly serve as a first source of support for people in psychological distress, including those whose distress is entangled with delusional beliefs. Prior work on LLM mental-health safety largely evaluates general therapeutic quality or single-turn crisis detection, leaving unclear how models behave when distress is intertwined with delusion over sustained conversations. We address this gap with matched multi-turn simulations, across clinically grounded personas and six LLMs, that pair each delusional conversation with a distress-only control to isolate the effect of delusional framing. This reveals a recognition-intervention gap: models detect distress at comparable rates regardless of framing, yet sharply fail to act on it once distress is embedded in delusion, with safety interventions suppressed by up to 4.5x. The failure tracks accumulated acceptance of the user’s premises rather than emotional validation. Worse, the intuitive fix of prompting models to assess user distress backfires under delusional framing; only delusion-aware prompting with explicit response guidance closes the gap, and even this depends on a delusion classifier that is itself unreliable on the most vulnerable models. Safe deployment therefore requires treating delusional framing as a distinct risk signal that overrides conversational accommodation.
[NLP-165] HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering
【速读】: 该论文旨在解决当前基于大语言模型的生物医学问答系统在评估中过度依赖答案准确率,而忽视了模型输出可解析性、遵循结构化可靠性指令能力、弱答案空间识别以及避免自信错误承诺等关键可靠性问题。其核心解决方案是提出HypothesisMed——一种面向生物医学多选题问答的推理时可靠性评估框架,通过结合直接提示(direct prompting)、思维链(chain-of-thought prompting)、HypothesisMed-v3提示策略与答案融合机制,在推理阶段实现对答案空间的结构化标注与置信度评估。其中,HypothesisMed-v3引入SPACE标签体系(VALID、INCOMPLETE、CONTRADICTED),用于表征答案空间的状态,并通过融合策略选择最终答案,从而提升模型在可解析性、空间覆盖度及可靠性报告方面的表现。实验结果表明,尽管答案准确率并非始终最优,但该框架显著提升了输出的可解析性与结构化可靠性信息覆盖率,且大幅降低虚假承诺行为。研究进一步揭示:答案准确率、可解析性、结构化可靠性报告、校准能力与错误承诺行为是可分离的独立能力,因此该工作贡献的核心并非追求通用性能上限,而是提供一个可复现的、具备审计能力的推理时评估框架,使生物医学问答模型能在结构化可靠性约束下作为可信赖的工作流组件进行部署与评估。
链接: https://arxiv.org/abs/2606.00971
作者: Md Motaleb Hossen Manik,Ge Wang
机构: Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model’s best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.
[NLP-166] Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning
【速读】: 该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在需要精确空间理解的任务中表现不可靠的问题,尤其体现在视角推理、方向比较和距离估计等场景中。由于多视角图像与单目视频中的空间线索往往稀疏且分散于冗余观测之中,难以有效组织与利用。现有基于重建的视觉基础模型(Vision Foundation Models, VFMs)虽能将观测信息聚合为显式的三维空间记忆(如点云),但直接以自由形式调用这些重建模型存在脆弱性:VLM可能错误触发工具、跳过必要的空间变换或误用中间结果。为此,本文提出Reasmory框架,其核心创新在于将空间推理建模为对重构空间记忆的结构化程序执行。Reasmory构建显式的三维记忆,并引入语义锚定的三维物体实例,同时设计轻量级领域特定语言(Domain-Specific Language, DSL),严格约束VLM在查询对象与相机、转换视角及渲染观测时的操作方式。生成的程序在执行前经过解析与验证,显著提升了与空间记忆交互的可靠性。在多视角图像与视频空间推理基准上的实验表明,相比GPT-5-mini和Gemini-3-flash等强基线模型,Reasmory实现了6%至18%的一致性能提升,证明了在受控、验证操作下使用显式三维记忆相较于自由形式工具调用更具优势。
链接: https://arxiv.org/abs/2606.00963
作者: Jixuan He,Xueting Li,Chieh Hubert Lin,Ming-Hsuan Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Vision-Language Models (VLMs) exhibit emerging spatial reasoning capabilities, yet they remain unreliable on tasks requiring precise spatial understanding, such as viewpoint reasoning, directional comparison, and distance estimation. In multi-view images and monocular videos, relevant spatial cues are often sparse and distributed across redundant observations, making them difficult to organize and exploit. Reconstruction-based Vision Foundation Models (VFMs) offer a natural way to aggregate such observations into explicit spatial memory, such as point clouds. However, simply exposing reconstruction models as free-form tools is brittle, VLMs may invoke tools incorrectly, skip required spatial transformations, or misuse intermediate results. We propose \textbfReasmory, a framework that formulates spatial reasoning as structured program execution over reconstructed spatial memory. Reasmory constructs explicit 3D memory, augments it with semantically grounded 3D object instances, and introduces a lightweight Domain-Specific Language (DSL) that constrains how VLMs query objects and cameras, transform viewpoints, and render observations during reasoning. Generated programs are parsed and validated before execution, enabling more reliable interaction with spatial memory than unconstrained tool use. Experiments on multi-view image and video spatial reasoning benchmarks show consistent gains of 6–18% over strong baselines, including GPT-5-mini and Gemini-3-flash, indicating that explicit 3D memory is most useful when accessed through constrained, validated operations rather than free-form tool calls.
[NLP-167] Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink
【速读】: 该论文旨在解决生成式模型中机制可解释性(mechanistic interpretability)的核心假设失效问题,即:在识别到某一表征签名(representational signature)的探测器(probe)是否等同于识别出执行相应计算的神经电路这一假设。研究发现,在Mamba-2架构中,这一假设会系统性地失效。其关键发现在于,状态汇点(state sink)——即边界标记上Δ门(Delta-gate)过度激活的现象——可分解为两个功能独立的头集(head sets)。单桶探测器(single-bucket probe)仅能恢复出较小的执行层(execution layer),而遗漏了具有相同表征签名但更大规模的检测层(detection layer)。具体而言,约5%的BOS特化头(BOS-specialist heads)在不同模型规模与语料上均对起始符号上下文和换行符目标预测具有因果支持作用;而通过多类聚合恢复的双头结构(dual heads,占27–35%)虽表现出更强的表征相似性,但在消融实验中因果效应显著减弱。这表明表征相似性并不等同于功能等价性。该差异对下游行为至关重要:移除BOS特化头会导致RULER NIAH检索准确率从1.00骤降至0.00(在1024上下文长度下),而其大小匹配的互补头则维持基线性能。随机通道分桶对照实验排除了硬件粒度因素的影响,指向Mamba-2中共享Δ投影(head-shared Delta projection)是导致此现象的根本原因。因此,该研究提出:仅凭探测结果无法直接确定执行电路,必须通过类别条件消融(class-conditional ablation)而非仅依赖余弦相似性(class-conditional cosine)来区分执行与检测电路。
链接: https://arxiv.org/abs/2606.00930
作者: Yuhang Jiang
机构: Independent Researcher
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 3 figures
Abstract:Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2’s head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine. Comments: 16 pages, 3 figures Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.00930 [cs.CL] (or arXiv:2606.00930v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.00930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-168] ask Structure Reverses Layerwise State Encoding in Sequence Models
【速读】: 该论文旨在解决生成式模型中层间状态编码(layerwise state encodings)的机制性问题,特别是揭示不同架构(如Transformer、Mamba、LSTM、GRU等)在处理特定任务时,其状态信息的可读性分布如何随任务性质变化而动态调整。传统观点认为,递归模型倾向于集中可读状态于早期层,而基于注意力的模型则将其分散于各层,但本研究发现这一模式在任务改变时会反转。关键发现在于:在奇偶校验(Parity)任务中,Mamba与递归基线模型的状态可读性集中在后期层,而Transformer则逐步构建该信息;而在有界深度的Dyck-k语言任务中,这一模式发生翻转。为区分文献中混淆的两种解释——代数结构(交换性)与计算结构(前缀更新 vs. 栈操作),作者引入非交换对称群S₃的置换组合任务,结果表明状态分布模式与计算结构一致而非交换性。进一步的因果干预分析显示,在浅层形式模型中,线性可读方向具有功能性必要性,并在分布外长度上仍保持重要性;但在预训练规模下,模型行为分化:微调后的Pythia在中层存在显著瓶颈(如160M模型中第6-7层消融导致准确率下降约81%),而预训练的Mamba虽最终层高度可读但无单一探针方向能破坏任务性能,然而中间位置激活修补即可恢复97%-98%的干净-损坏对数差距。这表明,探针定位的是线性可获取状态的位置,而非计算瓶颈所在。因此,该研究的核心结论是:机制性特征并非仅由架构决定,而是架构与任务共同作用的结果。
链接: https://arxiv.org/abs/2606.00926
作者: Yuhang Jiang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 20 pages, 11 figures, 8 tables
Abstract:Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are conflated in the literature: algebraic structure (commutativity) versus computational structure (prefix update vs. stack). To separate them we add a third task: non-commutative S_3 permutation composition. S_3 groups with Parity, not Dyck, on layerwise probing across all five architectures and on Mamba-specific Conv1D attribution, so the grouping tracks computational structure rather than commutativity. Causal interventions show that, in the 4-layer formal models, linearly readable directions are often functionally necessary and can remain important at out-of-distribution lengths on Parity and Dyck. At pretrained scale the picture splits. Fine-tuned Pythia Dyck has a strong middle-layer bottleneck (L6-L7 ablation drops accuracy by roughly 81% at 160M; broader L4-L18 plateau at 410M), far weaker at the best-probe layer. Pretrained Mamba shows the complementary failure mode: its final layer is highly readable, no single probe direction breaks the task on Parity, Dyck, or S_3, yet mid-position activation patching there recovers about 97-98% of the clean-corrupted logit gap. Probing localizes where state is linearly available, not always where the computation is bottlenecked. Mechanistic signatures are properties of architecture and task together. Comments: 20 pages, 11 figures, 8 tables Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL) Cite as: arXiv:2606.00926 [cs.LG] (or arXiv:2606.00926v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00926 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-169] owards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成式问答(Generative Question-Answering, QA)任务中普遍存在幻觉(hallucination)问题,即模型生成看似合理但事实错误的内容,尤其在高风险领域会严重影响可信度与安全性。针对这一挑战,论文提出一种参数高效的方法——负责任对比软提示(Responsible Contrastive Soft Prompting, RCSP),其核心在于通过可学习的软提示(soft prompts)实现对生成内容的可控调节。关键创新在于设计了一种复合损失函数,联合优化三个目标:抑制幻觉内容、在不确定性情境下促进模型主动拒绝回答(responsible abstention),以及保持或提升事实召回率。为此,方法融合了对比学习(contrastive loss)、课程学习(curriculum learning)和KL散度正则化(KL regularization)机制,有效引导软提示在训练过程中学习到更可靠的行为模式。实验在五个多样化生成式QA数据集上基于LLM-as-a-Judge框架进行评估,结果表明,使用Gemma 3 (12B)和Llama 3.1 (8B)作为骨干模型时,RCSP在事实准确率与幻觉抑制之间实现了良好平衡,显著优于标准推理与指令提示基线,且仅需微调极小比例的参数,展现出良好的计算效率与模块化部署潜力。
链接: https://arxiv.org/abs/2606.00919
作者: S M Tahmid Siddiqui,Akib Jawad Ononto,Anoop Singhal,Latifur Khan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer
Abstract:Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability. Comments: 20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.00919 [cs.CL] (or arXiv:2606.00919v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.00919 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-170] Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults
【速读】: 该论文旨在解决当前大语言模型(LLM)代理在实际应用中面临的一个关键安全评估盲区:现有评估体系通常仅测试模型本身或用户提示(prompt),而忽视了上游信息排序系统(如推荐算法、检索结果排序器等)对代理决策的潜在影响。其核心问题是,代理在作出最终决策前所接触的信息流(即“滚动”阶段的帖子组合与顺序)可能显著操控其行为,但这一因果效应长期未被系统性研究。论文提出的解决方案之关键在于设计一种受控实验协议,固定模型、角色设定、话题和最终决策提示,仅改变代理在前置十轮“滚动”阶段所遭遇的信息流组成与排序,从而精确隔离并量化信息流编排对下游决策的因果影响。通过在四个来自不同实验室的先进开源指令微调大模型上进行2,785次决策推演,研究发现存在三种响应模式:对抗性屈服、默认饱和,以及一种“方向不对称性”——单向信息流可将原本处于不确定状态的决策从5%导向100%(最显著案例中Fisher p值低至3×10⁻¹⁰),但无法改变已坚定支持的决策。该效应呈现剂量-反应关系,经生成器替换验证排除写作风格干扰,跨多个决策领域(包括移除部署审批门禁、放宽访问控制等安全敏感场景)具有泛化能力,并可通过简单的反馈层防御手段部分缓解。研究进一步指出,推荐系统本质上是大模型代理的一个可操作、默认有限的控制界面,因此必须将代理评估的审计重点从单一最终提示扩展至信息流层,以实现更真实、更安全的评估体系。
链接: https://arxiv.org/abs/2606.00914
作者: Rana Muhammad Usman
机构: Independent Researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 14 pages, 5 figures. Code, post pools, and 2,785 decision rollouts: this https URL
Abstract:LLM agents increasingly act after consuming ranked external information streams such as social feeds, search results, retrieval contexts, and email queues, yet safety evaluations almost always test the model or the user prompt in isolation, never the upstream ranker that decides what the agent reads just before it acts. We introduce a controlled protocol that holds the model, persona, topic, and final decision prompt fixed and varies only the composition and ordering of the posts an agent encounters during a preceding ten-turn “scrolling” phase, isolating the causal effect of feed curation on a downstream decision. Across 2,785 decision rollouts on four modern open instruct LLMs from three independent labs, we identify three response regimes: adversarial capitulation, default saturation, and a default-direction asymmetry in which a one-sided feed tips a decision the model was genuinely uncertain about (in the clearest cases from 5% to 100%; Fisher p as low as 3 x 10^-10) but cannot dislodge one it already favors or holds firmly. The effect follows a dose-response curve, survives a generator swap that rules out a writing-style artifact, generalizes across several decision domains including security-relevant choices such as removing a deployment approval gate or relaxing access controls, and is partly mitigated by two simple feed-level defenses; a frontier model retains its default. We characterize the recommender as a practical, default-bounded control surface for LLM agents, and argue that agent evaluations must audit the feed layer rather than the final prompt alone.
[NLP-171] MLLM -Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)内部表示机制不透明的问题,特别是对跨模态融合过程中令牌嵌入(token embeddings)的线性性、内在维度和各向异性等关键特性缺乏系统性分析的现状。其解决方案的关键在于提出一种名为MLLM-Microscope的新分析框架,通过在ScienceQA数据集上对两类先进MLLMs(LLaVA-NeXT与OmniFusion)进行深入剖析,量化评估不同变换器层中多模态令牌嵌入的几何属性。研究发现,尽管两种模型在主路径与残差路径中的令牌均表现出高度线性特征,但其图像令牌的线性保持能力及内在维度演化存在显著差异:OmniFusion的图像令牌在各层间维持更高的内在维度且各向异性始终较低,而LLaVA-NeXT则呈现轻微线性下降趋势。这一结果表明,多模态融合策略对模型内部表示结构具有决定性影响,揭示了模型设计中跨模态对齐机制的重要性。MLLM-Microscope所提供的可解释性洞察为未来优化模型架构与训练策略提供了重要依据。
链接: https://arxiv.org/abs/2606.00909
作者: Ravil Mussabayev,Rustam Mussabayev
机构: Satbayev University (萨特巴耶夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This work presents MLLM-Microscope, a novel system designed for analyzing the hidden representations within Multimodal Large Language Models (MLLMs). Our system evaluates the linearity, intrinsic dimension, and anisotropy of multimodal token embeddings across transformer layers. Utilizing the ScienceQA dataset, we evaluate two state-of-the-art MLLMs, LLaVA-NeXT and OmniFusion. We find that both the main and residual streams for tokens of both modalities exhibit highly linear behaviors across transformer layers. However, LLaVA-NeXT’s image tokens reveal a slight decline in linearity, whereas OmniFusion’s remain consistent. Image token dimensions in OmniFusion remain consistently higher across layers compared to LLaVA-NeXT. Also, the OmniFusion’s anisotropy is observed to stay consistently low throughout the layers. These findings suggest that the inner workings of MLLMs highly depend on the nature of modality fusion performed before passing the token sequence into LLM. This and other new potential insights obtainable from our system are surely capable of enhancing our understanding of the inner workings of MLLMs, informing future model design and optimization.
[NLP-172] Citation Grounding: Detecting and Reducing LLM Citation Hallucinations via Legal Citation Graphs
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在法律文本生成中系统性地产生虚假引用(legal hallucinations)的问题,包括虚构法条、引用已被废止的条款以及混淆管辖权等现象,而现有方法缺乏可扩展的自动化评估与缓解机制。其核心解决方案是提出引用锚定(Citation Grounding, CG),一种基于真实司法数据构建的多维度评估指标,通过三个子指标——引用精确性(验证引用条文是否存在)、引用相关性(是否在语境中恰当)和引用时效性(在特定时间点是否有效)——实现对幻觉类型的精细化诊断。为减少幻觉且无需人工标注,研究进一步提出引用锚定直接偏好优化(CG-DPO),通过四种针对性策略对真实判决中的已验证引用进行算法化扰动,生成偏好对以训练模型区分正确与错误引用。实验表明,在2,244个乌克兰法院判决数据集上,经LoRA微调的Qwen2.5-7B-Instruct模型在引用验证任务中达到98.5%的平均准确率,显著提升了引用可靠性。研究同时开源了引用图谱、评估框架及CG-DPO数据集,为后续法律生成模型的可信性提升提供了关键基础设施。
链接: https://arxiv.org/abs/2606.00898
作者: Volodymyr Ovcharov
机构: LEX AI LLC(LEX AI LLC); Anthropic(Anthropic); Mistral AI(Мистрал ИИ); Amazon Web Services(亚马逊网络服务)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL)
备注: 14 pages, 3 figures, 3 tables. Code and data: this https URL
Abstract:Large language models systematically hallucinate legal citations – fabricating statute references, citing repealed provisions, and confusing jurisdictions – yet no automated method exists to measure or reduce this behavior at scale. We propose citation grounding (CG), a metric that verifies LLM-generated legal citations against a ground-truth citation graph extracted from 100.8 million Ukrainian court decisions (502 million edges, 21,736 unique statute nodes). CG decomposes into three components – citation precision (does the cited provision exist?), citation relevance (is it contextually appropriate?), and citation temporality (was it valid at the relevant date?) – enabling differential diagnosis of hallucination types. Empirical evaluation on 100 Ukrainian legal queries across five systems – four commercial LLMs via AWS Bedrock (Claude Haiku 4.5, Mistral Pixtral Large, Amazon Nova Pro/Lite) and one RAG-augmented production system – reveals CG ranging from 0.791 to 0.873, with 13-21% of citations hallucinated. To reduce hallucinations without human annotation, we introduce Citation Grounding DPO (CG-DPO): a method that constructs preference pairs algorithmically by corrupting verified citations from real court decisions via four targeted strategies. On a dataset of 2,244 court decisions, a Qwen2.5-7B-Instruct model fine-tuned with LoRA achieves 98.5% mean validation accuracy in distinguishing correct from corrupted citations (rewards margin +14.9, std 0.3 pp across 3 seeds). The citation graph, evaluation framework, and CG-DPO dataset are released as open resources.
[NLP-173] Chunking Methods on Retrieval-Augmented Generation - Effectiveness Evaluation Against Computational Cost and Limitations
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中文本分块(chunking)策略有效性评估缺乏系统性与普适性的问题。当前主流的固定大小分块与语义分块方法虽被广泛采用,但近年来涌现的众多新型分块方法多针对特定场景或数据类型,且缺乏在多样化应用场景下的横向比较与实证支持,导致难以客观评估其性能优势。论文的关键贡献在于首次系统性地评估了多种分块方法在RAG中的表现,并揭示了分块过程并非简单的预处理步骤,而是引入了一系列深刻且常被忽视的影响因素,如信息碎片化、上下文断裂及检索相关性损失等。其解决方案的核心在于通过构建统一的评估框架,识别并量化不同分块策略对检索与生成质量的影响,从而为RAG系统中分块策略的选择提供可复现、可对比的科学依据。
链接: https://arxiv.org/abs/2606.00881
作者: Mateusz Śmigielski(1),Michał Rajkowski(1),Mateusz Zbrocki(1),Michał Bernacki-Janson(1),Karol Kunicki(1),Julianna Godziszewska(1),Maciej Piasecki(1),Konrad Wojtasik(1) ((1) Department of Artificial Intelligence, Faculty of Information and Communication Technology, Wrocław University of Science and Technology, Wrocław 50-370, Poland)
机构: Wrocław University of Science and Technology (弗罗茨瓦夫科学与技术大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has demonstrated significant capabilities in enhancing the performance of Large Language Models (LLMs). One of the key tasks in RAG systems is the chunking process. Traditionally, fixed-size chunking and semantic chunking have been the standard approaches. However, interest in chunking strategies has been increasing, leading to a growing number of proposed methods that often claim improved performance over these conventional techniques. Many of these approaches are tailored to specific use cases and data types, with limited evidence of their effectiveness across diverse scenarios. As a result, it remains challenging to directly compare different techniques and assess their relative strengths. To the best of our knowledge, this study is the first to systematically evaluate the effectiveness of a wide range of chunking methods and emphasize the underlying challenges of chunking strategies in RAG systems. While chunking is commonly treated as a simple preprocessing step, we show that it introduces a range of impactful and often overlooked issues.
[NLP-174] IDEAFix: Evaluation Framework for Creative Defixation Prompting in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在创造性问题求解与创意生成任务中,其创造力评估缺乏一致性与可解释性的问题。现有评估方法或局限于过于狭窄、脱离实际情境的任务,无法反映目标导向的生成过程;或采用过于宽泛的设置,使任务设计、提示策略与评价体系之间的多重因素相互混淆,难以准确分离各变量对生成结果的影响。尤其关键的是,结构化提示策略(structured prompting strategies)在塑造创意生成过程中的作用尚未得到充分探索。为此,论文提出IDEAFix——一个用于分析开放式创意生成任务中发散思维(divergent thinking)的评估框架。该框架通过控制设计场景的变体、任务属性及去固化提示策略,系统引导模型生成多个原创性解决方案,从而实现对结构化引导如何影响生成质量的可复现分析。研究发现,任务设计与属性选择显著影响模型表现,且简单提示策略可有效提升方案原创性;但同时观察到不同模型间输出存在持续同质化现象,揭示了其在生成多样化解决方案方面的内在局限。总体而言,IDEAFix提供了一个受控、可扩展的实验范式,有助于深入理解大语言模型创造力背后的机制。
链接: https://arxiv.org/abs/2606.00875
作者: F. Carichon,S. Sharma,M. Girard,R. Rampa,G. Farnadi
机构: McGill University (麦吉尔大学); Mila (Mila); Concordia University (康考迪亚大学); ÉTS (ÉTS)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) are increasingly used for tasks involving creative problem solving and idea generation. However, there is a lack of consensus concerning their creative capabilities: some studies report superior performances compared to humans, while others highlight structural limitations such as fixation and the homogenization of outputs. Existing evaluation approaches either rely on narrow, decontextualized tasks that do not capture goal-oriented generation or on broader settings that confound multiple aspects of the creative process, making it difficult to isolate the effects of task formulation, prompting, and evaluation design. Significantly, the role of structured prompting strategies in shaping idea generation remains underexplored. Therefore, we introduce IDEAFix, an evaluation framework for analyzing divergent thinking in open-ended idea generation tasks. We prompt models to generate multiple original solutions to controlled variations of short design scenarios, task attributes, and defixation prompting strategies. This design enables systematic analysis of how structured guidance influences LLMs’ idea generation. Our results show that both task formulation and attribute selection significantly affect models’ performance, and that simple prompting strategies can boost the originality of solutions. However, we also observe persistent output homogenization across models, confirming inherent limits in their ability to generate diverse solutions. Overall, IDEAFix provides a controlled, extensible framework for studying the mechanisms underlying LLMs’ creativity.
[NLP-175] GenPT: Beyond Self-Report for Reliable LLM Psychometrics via Generative Projective Testing
【速读】: 该论文旨在解决当前基于自陈量表(self-report questionnaires)评估人格化智能体(persona-conditioned agents, PC-Agents)心理状态时所面临的两大方法学挑战:一是训练语料库带来的内容污染(contamination from training corpora),二是由社会期许性或情境框架引发的方向性偏差(directional bias driven by social-desirability or contextual framing)。为克服上述局限,论文提出一种新型心理测量工具——生成式投射测验(Generative Projective Testing, GenPT),其核心创新在于将传统的主题统觉测验(TAT)、罗夏墨迹测验(Rorschach)和句子完成测验(SCT)与生成式AI(Generative AI)相结合,通过生成全新刺激材料,并构建三阶段评估流程,以提取标准化的心理指标与目标状态。实验结果表明,传统量表在社会期许性情境下表现出系统性的方向偏移,尤其在自杀意念维度上最为显著;而GenPT所捕获的行为模式则维持在对称基线附近,展现出更强的抗污染能力与情境敏感性。在纵向心理咨询场景中,以Qwen3为基座模型时,GenPT在抑郁评估上的变化幅度比传统量表高一个数量级。因此,GenPT的关键优势在于其对内容污染的抵抗能力、对偏差不对称性的缓解以及对上下文敏感性的适应性,可作为自陈法在高可靠性、低偏倚与动态情境评估中的有效补充。
链接: https://arxiv.org/abs/2606.00860
作者: Ming Wang,Shuang Wu,Bixuan Wang,Lu Lin,Yuxin Chen,Xiaocui Yang,Daling Wang,Shi Feng,Yifei Zhang,Yufan Sun
机构: Northeastern University(东北大学); Singapore Management University(新加坡管理大学); Northeast Normal University(东北师范大学); Southwest University(西南大学); Central University of Finance and Economics(中央财经大学); College of Arts, Northeastern University(东北大学艺术学院)
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Self-report questionnaires remain the prevailing tool for probing the psychological states of persona-conditioned agents (PC-Agents). However, classical instruments inherit two well-known threats: contamination from training corpora and directional bias driven by social-desirability or contextual framing. To overcome these methodological bottlenecks, we ask whether projective paradigms can be adapted into a robust psychometric tool. We introduce \textbfGenPT (Generative Projective Testing), which reformulates TAT, Rorschach, and SCT with newly generated stimuli and organizes assessment as a three-stage pipeline to derive standardized psychological indicators and target states. Evaluating PC-Agents induced via CharacterRAG and AnnaAgent profiles, we benchmark GenPT’s reliability and validity against classical questionnaires. The results indicate that questionnaires exhibit systematic directional shifts under social-desirability framing, most strongly on suicide ideation. In contrast, GenPT’s collected behavioral patterns stay near the symmetric baseline. Furthermore, under a longitudinal counselling context, GenPT-based depression assessment shifts by roughly an order of magnitude more than the questionnaire counterpart when Qwen3 serves as the backbone. Overall, GenPT complements self-report methods in scenarios where contamination resistance, bias asymmetry, and context sensitivity matter. Code and stimuli can be found at this https URL.
[NLP-176] Momento: Evaluating Persistent Memory and Reasoning with Multi-Session Agent ic Conversations
【速读】: 该论文旨在解决现有智能体(agentic AI)评估体系在多轮交互中忽视用户状态持续性与任务目标演化的问题,即当前基准测试仅限于单次会话评估,未能充分考虑用户历史行为、偏好及先前决策对后续任务执行的影响。其核心解决方案是提出Momento——一个面向多会话服务环境的持久化智能体任务完成基准,要求智能体在跨会话情境下通过工具调用执行具有因果意义的动作,同时处理时间依赖关系和动态演化的用户目标。关键创新在于引入“用户状态再验证”机制,强调智能体需识别并更新过时的历史信息,而非将其视为可靠的上下文代理,从而揭示了当前智能体在长时程人机交互中因误估用户状态而导致失败的根本缺陷,凸显出实际应用中智能体能力与真实需求之间的显著差距。
链接: https://arxiv.org/abs/2606.00832
作者: Adril Putra Merin,David Anugraha,Ayu Purwarianti,Genta Indra Winata
机构: Institut Teknologi Bandung(Bandung理工学院); Stanford University(斯坦福大学); Capital One(资本一号)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Recent advances in agentic AI have enabled agents to complete complex tasks through tool use, reasoning, and multi-step planning. Yet existing benchmarks evaluate agents within a single session, ignoring past actions, stated preferences, and prior decisions that agents must integrate to fulfill personalized user goals. We introduce Momento, a benchmark for persistent agentic task completion in multi-session service environments, requiring agents to take consequential, tool-mediated actions while resolving temporal dependencies and evolving user goals across sessions. Experimental results reveal that current agents fail primarily through misestimation of user state, treating prior session history as a reliable proxy for current context rather than stale information requiring re-validation, highlighting a substantial gap between current agent capabilities and realistic long-horizon human-agent interaction.
[NLP-177] Not All Flips Are Conformity: Decomposing Stance Convergence in Multi-Agent LLM Debate
【速读】: 该论文旨在解决多智能体辩论(Multi-agent Debate, MAD)中一个关键问题:当多个大语言模型(Large Language Models, LLMs)在推理过程中达成一致答案时,这种收敛究竟是源于深层次的理性思辨,还是仅仅出于社会性顺从(social compliance)。其核心挑战在于,传统评估指标“答案翻转率”(answer flip rate)将多种机制混杂在一起,包括自发不稳定性、立场诱导的从众行为以及基于推理的说服力。为此,作者提出一种三源分解框架(three-source decomposition framework),通过受控的反事实实验条件分离出这三种机制。研究发现,在主实验设置MMLU-Pro中,仅通过自我反思就有37%的问答对发生改变;而在不同模型家族和基准测试(如GPQA-Diamond)中,模型依赖性的不稳定性显著存在。严格意义上的从众行为在主设置中占比29%,且在多数模型复现中均表现为有害影响(正确答案转为错误的比例达57%-77%)。进一步的信息梯度实验表明,即使推理内容空洞,只要呈现为“推理形式”,仍可导致20%-39%的错误采纳率,说明形式化的推理表达本身具有强大的说服力。此外,研究发现可通过初始轮次特征预测有害从众风险(AUC=0.79),并设计针对性干预措施使有害从众降低13.6个百分点(p < 0.001)。然而,若缺乏正确性标签或自我反思控制,单纯减少同伴采纳并不能提升准确率,因为无法区分有益与有害的影响。因此,解决方案的关键在于:通过结构化反事实实验分离不同机制,并结合可解释的特征与靶向干预,实现对有害从众的有效识别与抑制,从而真正提升多智能体系统中的理性决策质量。
链接: https://arxiv.org/abs/2606.00820
作者: Xiqi Hao,Zengqing Wu,Yu-Xuan Qiu,Chuan Xiao,Ruiqi Xu,Shuyuan Zheng,Jianbin Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent debate (MAD) is a promising strategy for improving LLM reasoning, but when agents converge on a shared answer, it is unclear whether that convergence reflects genuine deliberation or social compliance. We show that the conventional answer flip rate conflates three distinct mechanisms: spontaneous instability, stance-induced conformity, and reasoning-induced persuasion. Our three-source decomposition framework isolates each through controlled counterfactual conditions. In the primary MMLU-Pro setting, 37% of agent-question observations change under self-reflection alone, while robustness tests show substantial model-dependent instability across GPQA-Diamond and three model families; strict conformity is 29% in the primary setting and remains predominantly harmful across model replications (57-77% correct-to-wrong). A controlled information-gradient experiment reveals that even vacuous reasoning is associated with 20-39% error adoption among resistant agents, with reasoning-like presentation carrying substantial persuasive weight. Harmful conformity can be predicted from Round 0 features (AUC = 0.79), and risk-targeted intervention reduces it by 13.6 percentage points (p 0.001). However, without correctness labels or self-reflection controls, reducing peer adoption does not improve accuracy, because harmful and beneficial influence cannot be distinguished.
[NLP-178] Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在迭代演进过程中安全对齐(safety alignment)性能非单调变化的问题,即模型安全性并非随版本升级而持续提升。其核心挑战在于传统静态基准测试无法捕捉模型在实际对抗性攻击下的安全脆弱性动态演变。研究采用质量-多样性进化算法(MAP-Elites)作为自动化红队探测工具,对谷歌Gemma系列四个版本(7B–31B)进行纵向、自适应的攻击演化分析,发现Gemma 3(12B)的安全漏洞显著加剧,攻击成功率(ASR)达68.7% ± 5.7%,显著高于Gemma 2(45.5% ± 7.2%;p = 0.030),且优于后续版本Gemma 4(33.9% ± 1.8%)。关键发现在于:攻击样本在不同代际间的迁移能力差异揭示了安全改进的非均匀性——针对早期模型生成的攻击在Gemma 3上仍保持44–46%的有效性,而在Gemma 4上仅14–18%,表明后者安全增强具有更强泛化能力。此外,虚假信息类攻击的ASR从Gemma 2的29%跃升至Gemma 3的99%并维持在Gemma 4的77%,说明安全退步未被完全修复。这些复杂模式无法通过静态评估识别,凸显了基于自适应、长期演化探针方法在揭示真实安全动态中的必要性。
链接: https://arxiv.org/abs/2606.00813
作者: Subhadip Mitra
机构: Rota Labs(罗塔实验室); Google(谷歌)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 8 pages, 3 figures
Abstract:Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google’s Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4’s safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at this https URL.
[NLP-179] Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety ICLR2026
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)对抗测试中存在的覆盖率不足问题:传统人工红队测试难以规模化,基于LLM作为攻击者的方法易出现模式崩溃(mode collapse),而基于梯度的方法生成的攻击文本往往缺乏可解释性且语义混乱。其解决方案的关键在于提出一种基于质量-多样性(quality-diversity)的进化框架,该框架在语义层面而非词元序列层面进行演化,通过优化可解释的攻击策略而非随机生成的文本序列,从而实现对模型漏洞的系统性探索。该方法采用MAP-Elites算法,构建一个跨行为维度(如攻击策略类型、编码方式、提示长度)的多样化攻击档案库,有效捕捉不同模型在特定攻击模式下的脆弱性特征。实验结果揭示了多款主流模型(GPT-4o-mini、Claude 3.5 Sonnet、Gemini 2.0 Flash及开源代码模型Devstral-small-2)的独特漏洞分布:例如,GPT-4o-mini对假设性与多轮对话结合ROT13编码的攻击高度敏感(适应度0.8),Gemini则易受直接攻击与多轮结合Leetspeak编码的影响(适应度0.8),而Claude表现出普遍模糊响应(最高适应度仅0.4)。该方法生成的语义可解释攻击揭示了模型固有的系统性弱点,为提升大语言模型安全性提供了可操作的洞察,并建立了可复现的基准,以评估未来前沿模型的安全性。
链接: https://arxiv.org/abs/2606.00801
作者: Subhadip Mitra
机构: Rota Labs; OpenAI; Anthropic; Google
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 9 pages, 6 figures. Accepted at the ICLR 2026 Workshop on Agents in the Wild (AIWILD)
Abstract:Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at this https URL.
[NLP-180] Confidence-Adaptive SwiGLU for Mixture-of-Experts
【速读】: 该论文旨在解决现代混合专家(Mixture-of-Experts, MoE)模型中门控激活函数(如SwiGLU)的门控锐度(gate sharpness)固定不变的问题,即在训练过程中无法根据输入令牌(token)的路由置信度动态调整门控函数的平滑性与选择性。传统SwiGLU采用固定的门控锐度参数,导致其在处理不同置信度的路由决策时缺乏灵活性。为解决此问题,作者提出了一种名为自信感知的SwiGLU(κ-SwiGLU)的新方法,其核心创新在于将SiLU门控函数的锐度系数建模为路由器对数几率(router logit)的可学习函数,使每个专家门控单元能够根据当前输入的路由置信度,在平滑广泛激活与尖锐选择性激活之间进行自适应插值。该方法在包含8至28层的MoE Transformer模型上于FineWeb-Edu数据集上的实验表明,κ-SwiGLU在几乎不增加参数量且仅有轻微计算开销的前提下,显著提升了平均CORE性能,验证了基于路由置信度动态调节门控锐度的有效性与潜力。
链接: https://arxiv.org/abs/2606.00761
作者: Shaohua Li,Xiuchao Sui,Xiaobing Sun,Yuhang Wu,Liangli Zhen,Yong Liu,Rick Siow Mong Goh
机构: Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore; Shanghai University of Engineering Science, China
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 13 pages, 10 figures
Abstract:SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness – the smoothness and selectivity of the gating function – is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ( \kappa -SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, \kappa -SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate \kappa -SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, \kappa -SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at this https URL.
[NLP-181] Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning
【速读】: 该论文旨在解决大语言模型在基于可验证奖励的强化学习(Reinforcement Learning from Verifiable Rewards, RLVR)过程中常见的熵崩溃(entropy collapse)问题,即策略逐渐趋于集中,导致回溯采样多样性下降和有效学习信号减弱。现有方法多通过外部干预手段如熵正则化或调整采样温度来缓解此问题,但这些方法无法内化于模型参数中。本文提出一种轻量级的在线自蒸馏策略——温度缩放的在线自蒸馏(Temperature-Scaled On-Policy Self-Distillation, TS-OPSD),其核心在于将温度调节带来的探索性效应直接内化至模型参数。具体而言,从熵崩溃后的强化学习检查点出发,TS-OPSD利用高温度缩放生成自身输出的平滑分布作为“自教师”(self-teacher),再将该分布通过知识蒸馏过程回传至原模型作为“学生”。该方法无需外部教师模型、特权数据或额外推理开销,实验表明其在Qwen3-4B-Base与Qwen3-8B-Base上的表现优于标准持续强化学习及采样阶段温度重加热策略。进一步分析显示,TS-OPSD主要降低输出尖锐度,同时保留中间表示、顶级候选集及推理能力。结果表明,熵恢复可作为一种简单有效的后崩溃干预手段,用于延长面向推理任务的强化学习生命周期。
链接: https://arxiv.org/abs/2606.00755
作者: Xuewei Yang,Jiachen Yu,Jie Wu,Shaoning Sun,Junjie Wang,Yujiu Yang
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model’s own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.
[NLP-182] I-WebGenBench : Evaluating Interactivity in LLM -Generated Scientific Web Applications
【速读】: 该论文旨在解决现有文档智能代理在处理技术性论文时存在的局限性问题,即当前方法多将研究论文转化为静态输出(如摘要、网页或幻灯片),难以有效表达涉及动态机制与状态变迁的复杂科学内容。其核心解决方案是提出一种“论文到交互系统代理”(Paper-to-Interactive-System Agent)框架,能够端到端地将PDF格式的研究论文自动转换为可执行的交互式网页系统。该框架的关键在于引入PaperVoyager——一个结构化生成范式,显式建模论文中的工作机制与交互逻辑,在系统建模与前端合成阶段保持对动态行为的精准刻画。通过构建包含19篇论文及其专家构建的交互系统作为真实标签的基准数据集,实验验证了PaperVoyager在生成交互系统质量上的显著提升,为科学论文的理解与交互式呈现提供了全新范式。
链接: https://arxiv.org/abs/2606.00750
作者: Dasen Dai,Biao Wu,Meng Fang,Shuoqi Li,Wenhao Wang
机构: Vast Intelligence Lab(大模型智能实验室); UTS(悉尼科技大学); University of Liverpool(利物浦大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures
Abstract:Recent advances in visual language models have enabled autonomous agents for complex reasoning, tool use, and document understanding. However, existing document agents mainly transform papers into static artifacts such as summaries, webpages, or slides, which are insufficient for technical papers involving dynamic mechanisms and state transitions. In this work, we propose a Paper-to-Interactive-System Agent that converts research papers into executable interactive web systems. Given a PDF paper, the agent performs end-to-end processing without human intervention, including paper understanding, system modeling, and interactive webpage synthesis, enabling users to manipulate inputs and observe dynamic behaviors. To evaluate this task, we introduce a benchmark of 19 research papers paired with expert-built interactive systems as ground truth. We further propose PaperVoyager, a structured generation framework that explicitly models mechanisms and interaction logic during synthesis. Experiments show that PaperVoyager significantly improves the quality of generated interactive systems, offering a new paradigm for interactive scientific paper understanding.
[NLP-183] From Empathy to Personalized Empathy: Adapting Empathetic Strategies to Individual Users
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期用户交互中缺乏个性化共情能力的问题,尤其关注用户人格特质对共情策略的影响被现有研究忽视的缺陷。其核心挑战在于如何根据用户的个性化特征(如历史行为、人格画像等)动态调整共情响应策略,以实现更精准、自然的互动体验。解决方案的关键在于提出一种名为PereGRM的奖励建模框架,该框架融合共情评估结构与动态评价标准生成机制,实现了细粒度的奖励建模,能够有效捕捉用户个性化需求下的共情表现差异。通过构建包含丰富用户历史与人格信息的PersonaEmp数据集,并结合多轮评估实验验证,PereGRM展现出在不同设置下均显著提升个性化共情能力的性能优势,验证了其在增强长期交互中个性化共情能力方面的有效性。
链接: https://arxiv.org/abs/2606.00728
作者: Wuqiang Zheng,Chengbing Wang,Yilin Yang,Junyi Cheng,Jianfei Xiao,Hu Sun,Yi Xie,Yangyang Li,Wenjie Wang
机构: University of Science and Technology of China (中国科学技术大学); Huawei Technologies (华为技术); China Academy of Cyber (中国网络研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:As Large Language Models (LLMs) are increasingly deployed in long-term interactions with users, empathy has become an increasingly important capability. However, existing research overlooks the influence of users’ personality traits on empathetic strategies during long-term interactions. To address this gap, we introduce the task of personalized empathy, which focuses on adapting empathetic strategies according to users’ personalized characteristics derived from history. To study and enhance this capability, we construct PersonaEmp, a personalized empathy dataset built from long-term user-AI interactions, featuring rich user histories, persona information, and empathy-seeking queries. We further propose PereGRM, a reward modeling framework that combines the empathy evaluation structure with dynamic evaluation criteria generation for fine-grained reward modeling. Experimental results across different settings and multiple judge models show that PereGRM consistently achieves the strongest performance improvements, indicating its effectiveness for enhancing personalized empathetic capabilities.
[NLP-184] WaveFilter: Enhancing the Long-Context Capability of Diffusion LLM s via Wavelet-Guided KV Cache Filtering
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, DLMs)在处理长上下文任务时,由于其多步迭代推理机制导致的计算开销大、推理延迟高这一核心瓶颈问题。尤其在长序列场景下,现有键值(Key-Value, KV)缓存机制面临生成质量显著下降的困境,其根本挑战在于如何在超长上下文中精确且高效地筛选关键令牌。为此,论文提出了一种通用且无需训练的缓存框架——WaveFilter,该方案受人类阅读过程启发,创新性地引入小波变换(wavelet transform)对长序列进行分解,实现对关键令牌的精准定位,并基于此构建稀疏的KV缓存以计算最终的上下文表示。其解决方案的关键在于利用小波变换的多尺度分析能力,从冗长的上下文中提取具有语义重要性的信息,从而在不增加计算负担的前提下提升长序列建模效率与生成质量。
链接: https://arxiv.org/abs/2606.00724
作者: Jinnan Yang,Yan Wang,Zhen Bi,Kehao Wu,Xiaojie Li,Jungang Lou,Zechao Li,Jing Liu
机构: Nanjing University of Science and Technology; Alibaba Group; Huzhou Normal University; Institute of Automation, Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages,3 figures
Abstract:Diffusion Large Language Models (DLMs) have demonstrated significant advantages across various tasks. However, constrained by their multi-step iterative inference mechanism, their computational overhead and inference latency in long-context tasks have become core bottlenecks restricting their large-scale deployment. When processing long sequences, existing Key-Value (KV) caching mechanisms often face a dilemma where generation quality degrades drastically, where the core challenge lies in precisely and efficiently filtering critical tokens within ultra-long contexts. Inspired by the human reading process, we propose \textbfWaveFilter, a universal and training-free caching framework. This framework innovatively introduces the wavelet transform for decomposition of long sequences to achieve precise identification of key tokens, based on which a sparse KV Cache is constructed to compute the final contextual representation. Experimental results demonstrate that WaveFilter, as a plug-and-play generic framework, significantly enhances the performance of existing mainstream KV Cache methods in complex long-context tasks.
[NLP-185] EPIC: Efficient and Parallel Inference under CFG Constraints for Diffusion Language Models
【速读】: 该论文旨在解决扩散语言模型(diffusion language model)在受上下文无关文法(CFG)约束条件下进行解码时效率低下的问题。现有方法虽然能够实现对生成输出的语法正确性控制,但其通过逐个验证生成过程引入了显著延迟,导致解码速度比无约束解码慢达四倍,并严重削弱了扩散模型相较于自回归模型所具备的并行解码优势。本文提出一种高效的CFG约束解码框架EPIC,其核心在于通过词法分析记忆化、采用类似Earley的解析方式替代确定性自动机进行有效性验证,以及引入松弛的兼容子集选择策略以支持并行提交多个合法标记。该方案有效减少了重复的词法分析与验证开销,同时允许批量提交兼容的生成结果,从而大幅提升了解码效率。实验表明,EPIC在三个基准测试上使用四种模型均实现了最高达67.5%的推理时间降低,且额外开销减少高达90.5%,显著优于现有方法。
链接: https://arxiv.org/abs/2606.00722
作者: Hyundong Jin,Yo-Sub Han
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Controlling language model outputs is essential for ensuring structural validity, reliability, and downstream usability, and diffusion language models are no exception. Recent advances in diffusion language model decoding have extended output control beyond regular constraints to context-free grammar (CFG) constraints. Existing methods, however, can be up to four times slower than unconstrained decoding. More importantly, they substantially diminish one of the key advantages of diffusion language models over autoregressive models, namely parallel decoding. This slowdown arises because sequential validity checking introduces significant overhead during parallel generation. We propose an efficient CFG-constrained decoding framework, EPIC, that addresses this limitation. Our method improves decoding efficiency by combining lexing memoization, validation using Earley-style parsing instead of deterministic automata, and relaxed compatible subset selection for parallel commit. It reduces repeated lexing and validation overhead while allowing multiple compatible tokens to be committed together. Experiments on three benchmarks using four models show that our method reduces inference time by up to 67.5% and decreases the additional overhead by up to 90.5% compared with existing CFG-constrained decoding methods. Our implementation is available at this https URL .
[NLP-186] OCC-RAG : Optimal Cognitive Core for Faithful Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中过度依赖参数化知识而忽视推理能力的问题,尤其针对需要高精度、可解释性推理的任务场景。现有模型虽通过规模扩张积累了海量知识,但在需多跳推理(multi-hop reasoning)且严格遵循给定上下文的问答任务中表现受限。为此,论文提出“最优认知核心”(Optimal Cognitive Core, OCC)系列小语言模型(Small Language Models, SLMs),其核心设计原则是聚焦特定任务的高效推理能力而非泛化知识存储。解决方案的关键在于构建一种名为OCC-RAG的优化架构,专为基于上下文的忠实问答(faithful QA)设计,强调在多文档上下文中进行多跳推理的同时完全依赖显式引用内容,拒绝使用内部记忆知识。为实现此目标,研究团队开发了一种可扩展的合成数据生成流水线,大规模构建包含超过三百万样本的多上下文、多跳问答数据集,重点确保推理过程的上下文忠实性与合理拒答(calibrated abstention)。所发布的OCC-RAG-0.6B和OCC-RAG-1.7B模型在热力图问答(HotpotQA)、MuSiQue、TAT-QA等多跳推理基准上,性能达到甚至超越自身规模2至6倍的通用大模型,同时具备结构化推理轨迹与基于原文引述的溯源能力,验证了任务专用型小型模型在复杂推理任务中的优越性。
链接: https://arxiv.org/abs/2606.00683
作者: Maksim Savkin,Mikhail Goncharov,Alexander Gambashidze,Alla Chepurova,Dmitrii Tarasov,Nikita Andriianov,Daria Pugacheva,Vasily Konovalov,Andrey Galichin,Ivan Oseledets
机构: OCC Team
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world’s knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 – 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.
[NLP-187] AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning
【速读】: 该论文旨在解决自然语言数学推理中生成式 AI(Generative AI)因缺乏可解释性与可信验证机制而导致的“不可靠输出”问题,尤其关注在实际应用中如何保证推理过程的确定性与答案的可验证性。其核心解决方案在于构建一种以信任为先的神经符号执行架构——AXIOM,其中语言模型仅作为规范化器(canonicalizer),将非正式的数学问题文本严格映射至由确定性计算机代数系统(CAS)处理的窄化模式(schema),从而实现答案的生成与形式化验证。该方案的关键在于建立1:1:1的路由对齐机制:问题结构正则表达式、特定模式提示词与闭式CAS处理函数之间一一对应,结合3,100余条预定义路径与零“丢失正确答案”(LOST_CORRECT)回归的稳定性保障,实现了94.36%的累计准确率与100%的解析可信度(无任何高置信度错误答案)。此外,通过“解析优先”上线策略、数学模板分桶(math-template bucketing)、LOST_CORRECT扫描作为回归检测器以及将“放弃回答”(abstain)作为首等输出,形成了一套可迁移的可信神经符号系统操作范式,确保新任务的引入不会破坏已有能力,使每一次生产中记录的放弃行为均成为下一轮迭代中潜在正确的候选,从而建立持续演进的信任动态。
链接: https://arxiv.org/abs/2606.00671
作者: Alessio Bruno
机构: Independent researcher(独立研究员)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint. 12 pages, 2 figures. Live interactive demo: this https URL . Paper artifact and dataset on Zenodo (concept-DOI): https://doi.org/10.5281/zenodo.20440225
Abstract:We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property – math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output – constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.
[NLP-188] FineVerify: Scaling Test-Time Compute with Fine-Grained Self-Verification for Agent ic Search
【速读】: 该论文旨在解决生成式 AI(Generative AI)在代理搜索(agentic search)任务中因答案稀疏性及模型校准不足导致的推理失败问题。当前基于评分的选择机制依赖于模型输出的置信度,但在复杂信息检索场景下,正确答案往往难以被有效识别,从而限制了测试时计算资源扩展(scaling test-time compute)的效果。其解决方案的关键在于提出一种细粒度自验证框架 FineVerify,通过将复杂问题分解为可验证的子问题,对每个采样候选答案进行逐项验证,并基于统一明确的标准聚合得分,实现更可靠、可解释的决策。该方法将全局选择转化为局部、可操作的判断过程,显著提升了答案选取的准确性与鲁棒性。实验表明,在四个代理搜索基准和两种模型上,FineVerify 均显著优于标准扩展基线;仅用4条轨迹采样即使 GPT-5-mini 提升8.2个准确率点,Gemini-3-flash 提升5.6%;使用12条样本时,GPT-5-mini 在 BrowseComp-Plus 上超越前沿 GPT-5。此外,其生成的可解释验证轨迹有助于审计错误,具备广泛应用于系统可解释性分析的潜力。
链接: https://arxiv.org/abs/2606.00660
作者: James Xu Zhao,Hui Chen,Bryan Hooi,See-Kiong Ng
机构: 未知
类目: Computation and Language (cs.CL)
备注: 8+18 pages, 6 tables, 11 figures
Abstract:Agentic search requires language model agents to explore many sources and answer complex information-seeking questions. Scaling test-time compute is a promising way to improve these agents, but current approaches can fail, because correct answers are often sparse and score-based selection depends on model calibration. We propose FineVerify, a fine-grained self-verification framework that decomposes each question into checkable sub-questions, verifies sampled candidates against each sub-question, and selects the candidate with the highest aggregated score. This per-check structure turns selection into simpler local judgments and produces scores under the same explicit criteria. Across four agentic search benchmarks and two models, FineVerify consistently outperforms standard scaling baselines. With only four sampled trajectories, it improves GPT-5-mini by 8.2 accuracy points and Gemini-3-flash by 5.6% on average. With 12 samples, FineVerify enables GPT-5-mini to surpass frontier GPT-5 on BrowseComp-Plus. Beyond accuracy, FineVerify produces interpretable verification traces that help audit benchmark errors, suggesting broader applications for inspecting agentic search systems. Code and data are available at this https URL
[NLP-189] MESA: Improving MoE Safety Alignment via Decentralized Expertise ICML2026
【速读】: 该论文旨在解决基于混合专家(Mixture-of-Experts, MoE)架构的大语言模型(Large Language Models, LLMs)中存在的“安全稀疏性”(Safety Sparsity)问题,即安全能力过度集中于少数专家,导致系统易受对抗性绕过攻击;同时针对传统对齐方法对所有参数进行均匀调整、忽视模块功能差异而引发性能退化的缺陷。其解决方案的关键在于提出一种名为MESA(MoE Safety Alignment)的针对性对齐框架,通过最优传输(Optimal Transport, OT)理论实现安全责任的策略性去中心化分配:一方面,利用传输成本矩阵进行专家容量重分配,将安全任务指派给最具成本效益的专家;另一方面,通过动态路由优化机制约束路由器精准激活这些分散的安全模块,从而在最大化安全覆盖范围的同时最小化对模型通用能力的干扰。实验表明,MESA在多种有害内容基准测试中均表现出稳健的防御性能,且有效保持了模型的有用性。
链接: https://arxiv.org/abs/2606.00651
作者: Yitong Sun,Yao Huang,Teng Li,Ranjie Duan,Yichi Zhang,Xingjun Ma,Hui Xue,Xingxing Wei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 8 figures, accepted by ICML 2026
Abstract:Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at this https URL.
[NLP-190] LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen 3-8B for Psychological Defense Mechanism Classification ACL2026
【速读】: 该论文旨在解决对话文本中心理防御机制(psychological defense mechanisms)识别这一具有挑战性的临床自然语言处理(clinical NLP)问题,特别是在存在严重类别不平衡的九类话语分类任务中提升模型性能。其解决方案的关键在于针对罕见类别表现不佳的问题,采用基于QLoRA的微调策略对Qwen3-8B模型进行优化,并结合三项核心技术:分组分层交叉验证(防止数据泄露)、少数类别轮换式词汇增强(minority-class round-robin lexical augmentation),以及包含逻辑偏置调优与集成融合的后处理流水线。这些方法协同作用,显著缩小了验证集与官方排行榜之间的性能差距,大幅提升了少数类别的召回率,尤其使原本几乎无法识别的“不确定”类(Unclear, Level 8)F1得分从接近零提升至0.797,成为实现高宏平均F1(0.3917)的核心驱动力。
链接: https://arxiv.org/abs/2606.00647
作者: Shefayat E Shams Adib,Ahmed Alfey Sani,Md Hasibur Rahman Alif,Ajwad Abrar
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at PsyDefDetect, a shared task at the 25th BioNLP Workshop (BioNLP 2026), co-located with ACL 2026 in San Diego, CA, USA
Abstract:Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (nine-class utterance classification evaluated via macro F1), our team LinguIUTics achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by 7.7 absolute points (24.4 percent relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logit bias tuning and ensemble blending. Together, these components close much of the validation-to-leaderboard gap and substantially improve minority-class recall, driving the critical “Unclear” class (Level 8) from near-zero performance to an F1 score of 0.797.
[NLP-191] French parsing enhanced with a word clustering method based on a syntactic lexicon
【速读】: 该论文旨在解决法语句法解析中因词汇歧义和语法结构复杂性导致的解析准确率不足的问题。其核心解决方案在于将从法语句法词典(Lexicon-Grammar,Gross, 1994)中提取的词汇信息整合进基于概率上下文无关文法(Probabilistic Context-Free Grammar, PCFG)的解析器中,并通过在法国树库(French Treebank,Abeillé et al., 2003)的动词上应用聚类方法,实现对动词语义-句法特征的有效归纳,从而显著提升解析性能。该方法的关键在于利用词典知识指导动词的聚类,使解析器能够更准确地捕捉动词在不同句法环境下的分布模式,进而增强模型对复杂句式结构的建模能力。
链接: https://arxiv.org/abs/2606.00634
作者: Anthony Sigogne,Matthieu Constant,Eric Laporte
机构: Université Paris-Est, LIGM (巴黎-东部大学,语言与信息研究实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).
[NLP-192] Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
【速读】: 该论文旨在解决自蒸馏(self-distillation)过程中因参考答案(reference answers)引入的强烈风格偏差问题,这种偏差导致生成式模型倾向于模仿表面形式而非学习有效的推理模式。其核心挑战在于:自蒸馏生成的重写数据中包含大量高困惑度(high-perplexity, PPL)token,这些token既可能源于有益的逻辑修正(知识增强),也可能源于对参考答案风格的不当模仿(风格漂移)。若不加区分地处理所有高PPL token,将破坏基础模型的原始分布,尤其在复杂推理任务上导致性能下降。为此,论文提出分布对齐的自蒸馏(Distribution-Aligned Self-Distillation, DASD),其关键在于引入一种答案感知的参考模型以生成候选token,并基于基础模型自身的置信度动态过滤这些token,从而保留蕴含有用逻辑知识的token,同时抑制与模型分布不一致的风格噪声。实验结果表明,DASD在数学、代码和常识推理等多个基准测试中均显著优于现有基线,有效减少高PPL token数量,并提升不同难度任务下的鲁棒性。
链接: https://arxiv.org/abs/2606.00628
作者: Ruiqi Zhang,Lingxiang Wang,Hainan Zhang Zhiming Zheng
机构: Beihang University (北京航空航天大学); Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing (未来区块链与隐私计算高精尖创新中心)
类目: Computation and Language (cs.CL)
备注: 12 pages, 13 figures
Abstract:Self-distillation improves learning efficiency by rewriting reference answers as training data that better matches the model’s own distribution. However, reference answers also introduce strong stylistic biases, causing the generative model to imitate surface forms rather than learn useful reasoning patterns. We observe that the rewriting data contains a large number of high-perplexity (PPL) tokens, coming from two distinct sources: beneficial knowledge-enhancing logical corrections, and harmful stylistic drift induced by reference imitation. Treating all such tokens equally can disrupt the base model’s original distribution and degrade performance, especially on difficult reasoning tasks. To address this, we propose Distribution-Aligned Self-Distillation (DASD), which uses an answer-aware reference model to generate candidate tokens and dynamically filters them according to the base model’s confidence. DASD preserves tokens that encode useful logical knowledge while suppressing distributionally misaligned style noise. Experiments on math, code, and commonsense reasoning benchmarks show that DASD consistently outperforms competitive baselines, reduces high-PPL tokens, and improves robustness across tasks of varying difficulty.
[NLP-193] MemPro: Agent ic Memory Systems as Evolvable Programs
【速读】: 该论文旨在解决长时程自主智能体(long-horizon autonomous agents)在面对复杂任务时,因受限于有限上下文窗口而难以有效保留历史信息、追踪动态状态及复用相关知识的问题。现有代理记忆系统普遍采用记忆构建-检索(Memory Construction-Retrieval, MCR)的固定流水线架构,其主要缺陷在于仅调整记忆库(memory bank)或提示文本(prompt text),而保持整个流水线结构不变,导致系统难以应对多样化的任务特异性失败模式,并在记忆库随规模与结构演化过程中出现功能错配。为此,本文提出MemPro——一个系统级可演进框架,将完整的MCR流水线视为可演化的程序而非静态组件。MemPro通过维护一个可运行的记忆系统实现版本树,由演化代理(Evolving Agent)迭代选择有潜力的版本,诊断重复性故障,并基于故障模式引导的编辑-调试-优化流程生成改进后的子版本。在LongMemEval、LoCoMo、HotpotQA和NarrativeQA等多个基准上的实验表明,MemPro在数次迭代内即显著优于静态及提示级演化基线,且性能随演化持续提升,同时实现了良好的性能-成本权衡。
链接: https://arxiv.org/abs/2606.00619
作者: Qingshan Liu,Guoqing Wang,Wen Wu,Jingqi Huang,Xinqi Tao,Dejia Song,Jie Zhou,Liang He
机构: East China Normal University; Xiaohongshu Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, 14 figures
Abstract:Long-horizon autonomous agents require memory systems to retain historical information, track evolving states, and reuse relevant knowledge beyond finite context windows. Existing agentic memory systems typically follow a memory construction-retrieval (MCR) pipeline, but often adapt mainly the memory bank while keeping the surrounding pipeline fixed after deployment. This fixed-pipeline design struggles to handle heterogeneous task-specific failure modes and can become misaligned with memory banks that evolve in scale and structure over time. To address these limitations, we propose MemPro, a system-level evolution framework that treats the entire MCR pipeline as an evolvable program rather than adapting only the memory bank or prompt text. MemPro maintains a version tree of runnable memory-system implementations, where an Evolving Agent iteratively selects promising versions, diagnoses recurring failures, and creates improved child versions through failure-mode-guided edit-debug refinement. Experiments on LongMemEval, LoCoMo, HotpotQA, and NarrativeQA show that MemPro consistently outperforms strong static and prompt-level evolving baselines within a few iterations, continues to improve with evolution, and achieves a favorable performance-cost trade-off. Code is available at this https URL.
[NLP-194] Linguistics-Aware Non-Distortionary LLM Watermarking
【速读】: 该论文旨在解决多语言环境下生成式AI(Generative AI)文本水印的鲁棒性与不可感知性难题,尤其针对不同语言在形态学、分词方式及书写系统上的差异所导致的水印证据自然嵌入困难的问题。其核心挑战在于如何在不降低生成质量或依赖模型提供方验证的前提下,实现跨语言、跨领域的高效水印检测。解决方案的关键在于提出LUNA——一种基于语言适应性的水印方法,其创新性体现在两个方面:一是采用无需模型依赖的检测机制,通过外部语料库中词性标注(part-of-speech tagging)上下文估计归一化下一标签熵(normalized next-tag entropy),以动态调节非破坏性二元锦标赛采样器的深度;二是检测端可仅凭文本、分词器、词性标注器与密钥重构相同的采样调度策略,实现无损且可验证的水印提取。实验表明,LUNA在六种类型多样化的语言和两个领域中均显著优于八种主流基线方法,在12个测试设置下取得0.9959的AUROC值,并将平均绝对中位困惑度偏移控制在0.045,其95%置信区间[0.022, 0.073]优于所有基线,同时在Self-BLEU、Distinct-1、意外度(surprisal)和熵等生成质量指标上也表现出最小的偏移,是唯一在多数场景中同时达到AUROC > 0.99且困惑度偏移 < 0.1的方法,在9/12设置中达成此目标,而基线方法最高仅在2个设置中实现。
链接: https://arxiv.org/abs/2606.00613
作者: Shinwoo Park,Hyejin Park,Hyeseon An,Yo-Sub Han
机构: Yonsei University (延世大学); Rensselaer Polytechnic Institute (伦斯勒理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Watermarking should identify language-model output without degrading quality or limiting verification to the model provider. Multilingual deployment makes this harder because morphology, segmentation, and script change where watermark evidence can enter naturally. We introduce LUNA, a linguistically adaptive watermark that combines model-free detection with single-token non-distortion under the standard random-key model. LUNA estimates normalized next-tag entropy from part-of-speech contexts in an external corpus and uses it to set the depth of a non-distortionary binary tournament sampler; the detector reconstructs the same schedule from text, a tokenizer, a tagger, and a secret key. We evaluate six typologically diverse languages and two domains against eight primary baselines. LUNA attains an AUROC of 0.9959 and the lowest mean absolute median perplexity shift of 0.045 across the twelve settings; its 95% bootstrap interval [0.022, 0.073] lies below all baseline intervals. LUNA also records the lowest mean Self-BLEU, Distinct-1, surprisal, and entropy shifts. It is the only method that simultaneously achieves AUROC 0.99 and an absolute median perplexity shift below 0.1 in a majority of settings, reaching this regime in 9 of the 12 settings while no baseline reaches it in more than 2. Our code is available at: this https URL
[NLP-195] oward Responsible and Epistemically Grounded Multilingual LLM s for Computational Social Science and Humanities
【速读】: 该论文旨在解决当前多语言推理大语言模型(Multilingual Reasoning LLMs)在人文社会科学(SSH)研究中应用时所面临的评估范式缺陷问题,具体表现为现有评价体系过度依赖任务导向的自然语言处理(NLP)基准,忽视了诠释有效性、文化情境性以及认识论中介性等关键维度。其解决方案的核心在于将多语言推理大语言模型重新概念化为解释学工具(hermeneutic instruments),强调其在跨语言与跨文化语境中主动建构意义的动态作用。基于解释学、技术哲学、科学与技术研究(STS)、多语言自然语言处理及计算社会科学方法论,论文构建了一个理论扎实的评估框架,提出可操作的指标体系,涵盖文化契合度、跨语言稳定性与推理忠实性,并针对诠释性研究任务设定了透明性要求。通过一个多语言政治话语分析的应用实例,验证了该框架的可行性与实践价值,为多语言推理大语言模型在计算社会科学基础设施中的负责任集成提供了概念与方法论基础。
链接: https://arxiv.org/abs/2606.00596
作者: Wajdi Zaghouani
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models have rapidly evolved in multilingual competence and reasoning capacity, enabling their integration into Social Sciences and Humanities research workflows. Yet existing evaluation paradigms remain anchored in task-based NLP benchmarks and fail to address interpretive validity, cultural situatedness, and epistemic mediation. This paper reconceptualizes multilingual reasoning LLMs as hermeneutic instruments that actively structure meaning production across linguistic and cultural contexts. Drawing on hermeneutics, philosophy of technology, science and technology studies, multilingual NLP research, and computational social science methodology, we develop a theoretically grounded framework for evaluating multilingual reasoning in Social Sciences and Humanities (SSH) research. We articulate a rigorous experimental protocol with operationalized metrics for cultural alignment, cross-lingual stability, and reasoning faithfulness, along with transparency requirements tailored to interpretive research tasks. We illustrate the framework through a concrete application scenario involving multilingual political discourse analysis. The paper contributes a conceptual and methodological foundation for responsible integration of multilingual reasoning LLMs into computational social science infrastructures.
[NLP-196] SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering
【速读】: 该论文旨在解决多答案问答(Multi-Answer QA)场景中长时程工具使用任务的挑战,即在面对需要发现全面有效答案的现实查询时,如何实现对复杂搜索轨迹中的细粒度信用分配以及激励持续探索以发现长尾实体。其核心解决方案是提出一种基于强化学习的框架SPADER,其关键在于引入分步同行优势(Step-wise Peer Advantage, SPA),这是一种无需评判器(critic-free)的逐步骤信用分配机制,通过在决策步骤层面对比并行轨迹的回报来估计优势值,从而实现更精准的策略优化;同时,SPADER设计了多样性感知的探索奖励机制,通过提升稀有发现的权重并抑制重复内容的奖励,有效促进对长尾实体的探索。实验结果表明,SPADER在QAMPARI、Mintaka、WebQSP和QUEST等多个基准数据集上显著提升了召回率与综合F1分数,优于基于提示(prompting-based)的代理、结果监督的强化学习方法及近期的逐步监督方法。
链接: https://arxiv.org/abs/2606.00593
作者: Qiming Shi,Zhaolu Kang,Yunfan Zhou,Di Weng,Yingcai Wu
机构: Zhejiang University(浙江大学); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly deployed as tool-augmented agents to acquire information beyond parametric knowledge. While recent work has improved long-horizon tool-use reasoning, most approaches focus on tasks with a single correct answer. In contrast, many real-world queries require discovering a comprehensive set of valid answers, a setting known as Multi-Answer QA. This setting raises two challenges: fine-grained credit assignment over long search trajectories and reward alignment for sustained exploration beyond easy high-frequency entities. We propose SPADER, a reinforcement learning framework for long-horizon tool use in Multi-Answer QA. SPADER includes Step-wise Peer Advantage (SPA), a critic-free step-level credit assignment mechanism that aligns parallel trajectories by decision step and estimates advantages from peer returns. It also includes a diversity-aware exploration reward that promotes long-tail entity discovery by upweighting rare findings and downweighting redundant ones. Experiments on QAMPARI, Mintaka, WebQSP, and QUEST show that SPADER generally improves recall and overall F1 over prompting-based agents, outcome-supervised RL methods, and recent step-level supervision approaches. Our code and model weights are available at this https URL.
[NLP-197] Sandboxed Coding Agents are Competitive Omni-modal Task Solvers
【速读】: 该论文旨在解决多模态大模型(Multimodal Large Language Models, LLMs)在处理音视频任务时普遍依赖原生全模态(omnimodal)架构的假设问题,即认为此类任务必须由具备音频、视频等多模态输入能力的模型才能有效完成。研究发现,仅具备文本与图像输入能力并配备沙盒化工具调用接口的编码代理(coding agents),在多个音视频基准测试中不仅能够达到甚至超越当前最优的原生全模态模型及预定义的多模态代理框架的表现。其解决方案的关键在于:通过编写代码并协调使用外部工具,从语音转录文本、视频帧及其他模态信号中精准提取相关证据,将复杂的全模态任务转化为以信息检索和结构化处理为核心的问题,而非直接摄入完整的音视频流。这一策略显著降低了对模型自身多模态感知能力的依赖,提升了推理效率与准确性。研究进一步通过失败归因分析与过程级追踪揭示了此类方法的局限性,并证明通过引入人类编写的或自蒸馏生成的技能(skill injection)可显著提升性能。为此,作者提出了开源训练方案 Code-X,包含 OmniCoding 轨迹数据集与可验证奖励机制,并在 Qwen-3.5-9B 与 Qwen-3.6-27B 模型上提供了基线结果。最后,论文指出未来发展的前沿方向是“多模态处理”(many-modality processing),并提出 TerminalBench-O——一个面向真实世界全模态任务的过程级评估基准,以推动该领域的发展。
链接: https://arxiv.org/abs/2606.00579
作者: Dongping Chen,Xuanao Huang,Zhihan Hu,Qingyuan Shi,Dianqi Li,Tianyi Zhou
机构: University of Maryland; MBZUAI(穆巴达拉人工智能研究所)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper under review
Abstract:As multimodal LLMs increasingly target video and audio, it is often assumed that such tasks require native omnimodal models. We show that this is not always the case: coding agents with only text+image access and a sandboxed tool-use interface can match, and in several settings outperform, SOTA native omnimodal models and predefined multimodal agent scaffolds across multiple audio-video benchmarks. Our trajectory analysis suggests that their strength comes from writing code and orchestrating tools to extract relevant evidence from transcripts, frames, and other modality signals, thereby converting omnimodal tasks into retrieval and information-processing problems rather than ingesting entire media streams. We further characterize their limitations through a failure taxonomy and process-level trace analysis, and show that simple skill injection, including human-written and self-distilled skills, substantially improves performance. To explore open-source elicitation, we introduce Code-X, a training recipe with the OmniCoding trajectory dataset and verifiable reward, and provide baselines on Qwen-3.5-9B and Qwen-3.6-27B. Finally, we argue that the next frontier is many-modality processing, and introduce TerminalBench-O, a process-level benchmark for real-world omnimodal processing tasks. Code will be available at this https URL.
[NLP-198] Revisiting Parameter-Based Knowledge Editing in Large Language Models : Theoretical Limits and Empirical Evidence ICML2026
【速读】: 该论文旨在解决参数化知识编辑(parameter-based knowledge editing)在实际应用中引发的模型性能退化问题,特别是其对大语言模型(LLM)核心推理能力的潜在破坏。现有方法普遍忽视了参数局部修改可能引发的理论局限性,且缺乏在贴近真实应用场景下的系统性评估。论文提出基于“维度坍缩假说”(Dimensional Collapse Hypothesis)的理论分析,揭示局部参数更新会沿着表示空间中脆弱的方向传播,导致全局干扰并最终引发推理崩溃。在此基础上,研究通过系统性实验,考察知识复杂度、编辑数量、评估维度及基线方法等变量的影响,结果表明:所有参数化编辑方法均显著损害模型的核心能力;而一种简单的检索增强基线方法在所有条件下均优于所有参数编辑方法。因此,该研究的关键解决方案在于强调:未来知识编辑研究必须将保持大语言模型固有推理能力作为核心目标,而非仅关注知识更新的准确性。
链接: https://arxiv.org/abs/2606.00570
作者: Wanying Ren,Xin Song,Futing Wang,Guoxiu He,Aixin Sun
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix
Abstract:Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.
[NLP-199] Same Payload Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models EMNLP2026
【速读】: 该论文旨在解决生成式 AI(Generative AI)在具备代理能力(agentic roles)时,其安全风险随攻击内容输入渠道不同而呈现显著差异的问题。随着语言模型开始调用外部 API、读取工具输出并执行第三方内容中的指令,攻击面已从用户直接输入扩展至工具元数据与工具输出等间接通道。研究发现,当前模型对恶意指令的敏感性并非恒定,而是高度依赖内容的传递路径——即来自用户消息、工具描述或工具输出的相同恶意文本,可能引发不同的安全响应。为此,作者提出安全不对称评分(Safety Asymmetry Score, SAS),通过匹配的恶意载荷对,仅改变内容交付上下文,系统评估模型在不同输入通道下的脆弱性差异。实验覆盖六款生产级大语言模型(LLM)和三类攻击类型,结果显示:代理型模型在恶意内容通过工具描述传递时显著更易受攻击,而通用模型则相反;当内容由工具输出传递时,这种不对称性进一步反转,表明模型隐式将工具元数据视为可信指令,而将工具输出视为普通数据。对 Llama 3.3 70B 的机制分析揭示,与安全相关的表征存在于网络中后段深度,且以非线性方式编码,解释了为何传统线性探测方法失效。该研究揭示了当前工具使用模型中存在一种系统性、渠道依赖的安全盲区,其关键在于模型对不同输入来源的信任程度不一致,导致安全防护策略无法统一有效。
链接: https://arxiv.org/abs/2606.00566
作者: Mohammed Sameer Syed,Rozhin Yasaei(University of Arizona)
机构: University of Arizona
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 13 pages, 1 figure. Submitted to EMNLP 2026
Abstract:As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model treats a malicious instruction the same way regardless of where it arrives has not been systematically studied. We introduce the Safety Asymmetry Score (SAS), which measures how much a model’s susceptibility to adversarial content shifts depending on whether that content arrives in the user message, tool metadata, or tool output, using matched payload pairs that keep the malicious text identical and vary only the context of delivery. Evaluated across 6 production LLMs and three attack families, we find a consistent and informative asymmetry: agent-native models are substantially more vulnerable when adversarial content arrives via tool descriptions than via user messages, while general-purpose models show the reverse. This asymmetry further inverts when the same content is delivered through tool outputs rather than descriptions, suggesting models implicitly treat tool metadata as trusted instructions and tool results as ordinary data. A mechanistic study on Llama 3.3 70B reveals that the safety-relevant representation is causally present at mid-to-late network depths but non-linearly encoded, explaining why linear probes fail to detect it. These findings expose a systematic, channel-dependent blind spot in how current tool-using models handle adversarial content.
[NLP-200] Decomposed On-Policy Distillation for Vision-Language Reasoning : Steering Gradients for Visual Grounding ICML2026
【速读】: 该论文旨在解决在多模态领域中,基于策略的蒸馏(on-policy distillation)方法在训练小型推理模型时存在的优化动态不明确问题。现有方法通常采用单一整体目标函数进行联合优化,但其内在机制未被充分理解。研究发现,视觉-语言模型(VLM)蒸馏中的损失可数学分解为语言先验(language prior)与视觉定位(visual grounding)两个独立成分,且二者对应的梯度向量近乎正交,表明语言分布对齐与视觉感知匹配在几何上相互独立。因此,标准优化过程被动地沿着一个次优折衷路径前进,难以有效提升关键的视觉定位能力。针对这一瓶颈,本文提出视觉梯度引导(Visual Gradient Steering, VGS)方法,通过动态调整更新方向,主动强化对视觉子空间的优化,从而优先提升模型的视觉定位性能。实验结果表明,VGS在多种蒸馏设置和复杂多模态基准测试中均显著优于传统单体式蒸馏方法,在几乎无额外训练开销的前提下实现了更强的视觉-语言对齐能力。
链接: https://arxiv.org/abs/2606.00564
作者: Hee Suk Yoon,Eunseop Yoon,Jaehyun Jang,SooHwan Eom,Ji Woo Hong,Mark Hasegawa-Johnson,Qi Dai,Chong Luo,Chang D. Yoo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: ICML 2026 Spotlight
Abstract:While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher’s language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.
[NLP-201] Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents
【速读】: 该论文旨在解决交互式文本转SQL(text-to-SQL)智能体在多轮交互中长期记忆(long-term memory)利用效率低下的问题。现有记忆检索方法存在两大局限:静态方法依赖固定相似性启发式规则,无法优化下游任务效用;动态方法通常仅基于稀疏的最终结果进行学习,并在单一决策时点检索记忆,难以适应记忆价值随交互阶段变化的特性——例如,初始规划阶段所需的全局策略记忆与执行阶段所需的状态敏感局部决策记忆具有显著差异。为此,论文提出MERIT,一种动态多时域(multi-horizon)记忆检索框架,通过分层设计实现记忆的精细化利用:在任务层面维护全局战略引导的记忆,在对话回合层面构建局部决策支持的记忆,并分别采用强化学习训练的检索策略以优化其使用。为缓解中间阶段监督信号匮乏的问题,MERIT引入轻量级过程奖励模型(Process Reward Model),提供密集的代理奖励以指导回合级记忆选择。在BIRD-Interact上的实验表明,MERIT在成功率和平均交互轮次上均优于无记忆、静态检索及现有动态检索基线;在Spider2-Snow上的迁移实验进一步验证了其跨基准泛化能力,无需特定基准调优即可取得正向效果。研究结果表明,多时域记忆检索机制显著提升了交互式text-to-SQL智能体对过往经验的复用能力。
链接: https://arxiv.org/abs/2606.00547
作者: Yibo Wang,Nikki Lijing Kuang,Philip S. Yu,Zhewei Yao,Yuxiong He
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校); Snowflake AI Research (Snowflake人工智能研究)
类目: Computation and Language (cs.CL)
备注:
Abstract:Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.
[NLP-202] Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization
【速读】: 该论文旨在解决现代语言模型微调中因每个提示(prompt)仅配对单一响应而导致的“模式彩票”(mode lottery)问题,即训练过程随机强调少数合理输出模式,而忽略其他有效模式,从而削弱了模型对多模态输出分布的建模能力。其核心解决方案是提出多响应微调(Multi-Response Training, MRT),通过保留每个提示的多个有效响应,以更完整地捕捉条件输出分布。关键洞察在于:提示与响应在统计上具有不同作用——新增提示降低输入分布的不确定性,而新增响应则降低条件输出分布的不确定性。由此引出一个方差预算权衡(variance-budget tradeoff)机制,可预测在何种情况下保留多响应具有价值:当提示冗余度较低、响应多样性较高时,多响应带来的信息增益显著;反之,当提示层面的不确定性占主导时,边际收益递减。研究进一步揭示,基于奖励的选择易引发模式坍缩(mode collapse),而随机选择K个响应中的任意一组(Random-K-of-N)是无偏的默认策略;此外,提出一种具有理论保障的子模质量-多样性目标函数,作为高效且稳健的替代方案。实验验证了预测的方差效应与选择偏差,包括奖励驱动选择导致梯度与真实目标不一致的严重失效模式。在结构化及真实数据集上的结果表明,MRT显著提升了分布外泛化性能,尤其在高响应多样性、低提示冗余的场景下优势突出。最终,该工作将响应多样性视为一种数据分配策略,证明在响应成本低廉且多样时,保留多个响应并非启发式手段,而是具有统计基础的最优选择。
链接: https://arxiv.org/abs/2606.00544
作者: Hasan Amin,Kian Ahrabian,Ming Yin,Rajiv Khanna
机构: Purdue University (普渡大学); University of Southern California (南加州大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the “mode lottery,” where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.
[NLP-203] ProactiveLLM : Learning Active Interaction for Streaming Large Language Models ICML2026
【速读】: 该论文旨在解决标准大语言模型(Large Language Models, LLMs)在处理流式输入时因采用“读取后生成”范式而导致的延迟高、计算冗余问题。尽管流式LLMs通过边接收输入边生成的方式缓解了部分延迟,但其仍面临交互时机难以动态决策的挑战:现有方法或依赖硬编码的交互时间点,或需昂贵的外部对齐信号(如时间标签、推理轨迹或强教师模型),限制了实际应用的灵活性与可扩展性。为此,本文提出ProactiveLLM,其核心创新在于利用模型自身的内生状态(endogenous states)实现主动交互决策。关键解决方案包括两种互补的训练机制:一是基于掩码的流式建模(mask-based streaming modeling),通过对输入施加单调随机掩码模拟渐进式流式输入,使模型从局部输入视图中学习语义依赖;二是同步特权自蒸馏(Synchronized Privileged Self-Distillation, SPSD),将同一演化模型生成的不完整上下文学生视图与全上下文教师视图对齐,利用特权全上下文信息指导学生在不完备观测下的理解。这两种机制共同诱导出无需外部教师或标注的内生充分性线索(endogenous sufficiency cues),为插件式集成多种决策头提供了通用基础。在文本与语音流式任务上的大量实验表明,ProactiveLLM显著降低了交互延迟,同时保持高质量输出,验证了其在动态、主动交互方面的强大能力。
链接: https://arxiv.org/abs/2606.00523
作者: Junlong Tong,Yao Zhang,Anhao Zhao,Yingqi Fan,Yunpu Ma,Xiaoyu Shen
机构: 未知
类目: Computation and Language (cs.CL)
备注: ICML 2026
Abstract:Standard Large Language Models (LLMs) follow a read-then-generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard-code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model’s endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask-based streaming modeling and synchronized privileged self-distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial-input views. The latter aligns the partial-context student view with a full-context teacher view generated by the same evolving model, allowing privileged full-context evidence to guide the student’s understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug-and-play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at this https URL.
[NLP-204] Skill or Skip? Learning Selective Skill Invocation in Agent ic Tasks via Dual-Granularity Preference Learning
【速读】: 该论文旨在解决生成式智能体在执行复杂任务时,因盲目调用不相关技能而导致上下文干扰和执行流程中断的问题。现有方法多聚焦于技能选择或技能自身优化,却忽视了在当前决策点是否应实际调用某一相关技能这一关键问题。为应对这一挑战,论文提出SelSkill——一种双粒度偏好学习框架,其核心在于将技能调用建模为“调用或跳过”的决策问题,利用预测不确定性识别高优先级的决策点,并基于共享轨迹前缀构建受控的“调用-跳过”偏好对。该框架进一步融合了基于回合的最终结果偏好与基于步骤的调用偏好,从而同时捕捉整体轨迹质量与技能调用的局部有效性。实验结果显示,在ALFWorld(Qwen3-8B)上,任务成功率提升10.9个百分点,执行精度提升29.1个百分点;在BFCL上,任务成功率提升5.7个百分点,执行精度提升29.5个百分点。零样本测试在Tau-bench和PopQA上的表现表明,所学习的调用策略具备跨领域迁移能力,可有效应用于包含未见过技能的新场景。
链接: https://arxiv.org/abs/2606.00510
作者: Chishui Chen,Jiaye Lin,Te Sun,Junxi Wang,Yi Yang,Cong Qin,Yangen Hu,Lu Pan,Ke Zeng
机构: Meituan(美团); Fudan University(复旦大学); Shanghai Jiao Tong University(上海交通大学); Nanjing University(南京大学); Peking University(北京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 4 figures, 10 tables
Abstract:Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.
[NLP-205] LaSR: Context-Aware Speech Recognition via Latent Reasoning
【速读】: 该论文旨在解决当前语音大语言模型(Speech LLMs)在上下文感知能力上的局限性,即其在语音识别过程中难以有效捕捉说话者意图与话题上下文,导致对专业术语等关键信息的识别准确率不足。其解决方案的核心在于提出一种名为LaSR(Latent Speech Reasoning,隐式语音推理)的新型训练范式,通过引入基于声学特征区域的上下文感知推理轨迹,将思维链(Chain-of-Thought, CoT)监督对齐至目标词汇的声学特征区间,并设计隐式推理阶段以实现上下文信息的锚定与转录过渡。该方法不依赖显式的中间生成令牌,从而在不增加延迟的前提下显著提升了专业术语的识别性能。为验证该方法在特定领域词汇上的表现,研究还构建了大规模语料库Spoken Darwin-Science,专注于学术术语的语音理解。实验结果表明,LaSR在Fun-Audio-Chat基准上优于标准监督微调基线,验证了隐式推理在构建高效、上下文感知型语音助手中的潜力。
链接: https://arxiv.org/abs/2606.00507
作者: Heyang Liu,Ziyang Cheng,Jiayi Huang,Wenyang Xiao,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker’s intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.
[NLP-206] “I Strongly Suspect This Website Is a Scam”: Benchmarking PII Leakage and Detection without Defense in Autonomous Web Agents
【速读】: 该论文旨在解决生成式智能体(Generative AI agents)在面对社会工程学攻击(social-engineering attacks)时,极易泄露敏感个人身份信息(critical-tier PII)的问题。此类攻击通过伪装成合法网页环境,诱导自主代理错误提交用户隐私数据,对部署的智能体系统构成严重威胁。其解决方案的关键在于揭示现有防御机制的根本缺陷:当前依赖于智能体自身识别攻击信号(如推理过程中的怀疑判断)的防护策略存在“检测—行动鸿沟”(detection–action gap),即即便智能体已通过独立大模型判断站点可疑,仍有高达35.9%的会话仍提交关键信息,远高于无怀疑判断时的66.1%,表明个体认知不足以保障行为安全。因此,论文提出应放弃依赖内部推理信号的防御范式,转而采用输出层拦截机制——在不依赖智能体自身判断的前提下,对即将发出的数据提交进行外部干预,从而实现更可靠、更鲁棒的隐私保护。
链接: https://arxiv.org/abs/2606.00497
作者: Soham Roy,Sarthakbrata Halder,Arya Bharaty,Vaibhav Bhaskar,Yash Sinha,Dhruv Kumar,Srikant Panda,Murari Mandal
机构: KIIT Bhubaneshwar(基伊特布班什瓦尔); BITS Pilani(比茨皮拉尼); Lam Research(兰研究)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注: 24 pages
Abstract:Deceptive web content, widely instantiated across the internet and commonly known as \textitsocial-engineering attacks, manipulates autonomous web agents into submitting users’ personally identifiable information (PII) to attacker-controlled endpoints. In this paper, we show that social-engineering attacks are highly effective at extracting critical-tier PII from frontier web agents, posing a severe risk to deployed agentic systems. To quantify this risk, we introduce \textbf\textscScammer4U, a pre-registered benchmark of 91 attacker-controlled environments and 10 benign-twin baselines, spanning 8 attack vectors and 16 site categories on an 8-axis factorial taxonomy that isolates the causal contribution of individual attack design factors. Across frontier agents, we find that critical-tier PII leakage reaches 54–93% under no privacy guidance, compared to 0% on benign-twin baselines, confirming that leakage is attack-attributable rather than incidental form-filling. Escalating prompt-level mitigation yields sharply model-dependent reductions across the four families and remains insufficient to reliably prevent critical PII submission at the pooled level. Most critically, we identify a detection–action gap: agents whose reasoning an independent LLM judge confirms has flagged the site as suspicious still submit critical PII in 35.9% of sessions, versus 66.1% when no suspicion is verbalized, a 30.2% gap robust across all four model families. Our findings reveal that defenses conditioned on the agent’s own recognition of an attack are gating on the wrong signal, motivating output-level interception of outbound submissions that operates independently of the agent’s reasoning loop.
[NLP-207] Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs ICML2026
【速读】: 该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在实际应用中如何有效更新内部知识的问题,特别是关注文本知识编辑是否能够可靠地跨模态迁移至图像生成任务。其核心挑战在于:尽管文本侧的知识编辑已相对成熟,但现有方法在修改文本输出时所实现的编辑效果,并不能保证同样有效地影响图像生成结果,存在显著的模态间差距。解决方案的关键是提出一种基于推理增强的参数编辑方法(Reasoning-augmented Parameter Editing),通过在生成前显式激活已编辑的知识,增强文本表示与视觉生成条件路径之间的对齐性,从而显著提升跨模态知识编辑的有效性。实验表明,该方法可使整体视觉问答(VQA)准确率最高提升18.6个百分点,且机制分析揭示了编辑效果不佳的根本原因在于文本表示与图像生成条件路径之间存在部分对齐偏差。研究结果表明,单纯依赖文本编辑无法确保可靠的跨模态知识转移,亟需发展具有模态感知能力的编辑方法。
链接: https://arxiv.org/abs/2606.00477
作者: Xin Gao,Cheng Yang,Chufan Shi,Taylor Berg-Kirkpatrick
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at ICML 2026; Code and data available at this https URL
Abstract:Unified multimodal models (UMMs) have emerged as a promising paradigm for general-purpose multimodal intelligence. As they are deployed in real-world applications, effectively updating internal knowledge becomes critical. While knowledge editing has matured for text-only models, it remains unclear whether edits that successfully modify textual outputs also transfer to image generation in UMMs. To study this question, we introduce UniKE, the first benchmark for cross-modality knowledge editing in UMMs, comprising 2,971 edit subjects spanning attribute and relation edits. Using VQA-based visual verification, we reveal a striking modality gap: text-side efficacy can reach approximately 92%, whereas the best overall VQA accuracy under direct image generation is only 18.5%. We further propose Reasoning-augmented Parameter Editing, which explicitly activates edited knowledge before generation and improves overall VQA accuracy for all evaluated model-editor pairs, with gains up to 18.6 percentage points. Mechanistic analysis shows that this gap is associated with partial alignment between edited textual representations and the conditioning pathways for visual generation, where edits sufficient for text outputs may remain too weak or misaligned to steer image synthesis. These findings show that textual knowledge edits do not guarantee reliable cross-modality transfer and motivate modality-aware editing methods. Our code and data are available at this https URL.
[NLP-208] On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance ICML2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在零样本标注(zero-shot annotation)及“模型作为裁判”(LLM-as-a-judge)任务中可靠性不足的问题,核心在于探究模型内部先验知识与用户指令之间交互的三个关键维度:(1)模型对数据和任务定义的熟悉程度如何影响性能;(2)提示词中附加信息能否有效纠正零样本错误(即“决策固着”现象);(3)模型对任务定义不一致的敏感性。研究发现,近三分之二的零样本错误无法通过提示修正,整体修正率仅为34.8%,且高置信度错误尤为顽固;当任务定义存在偏差时,模型仍会遵循错误定义,同时保持与正确定义下相同的置信水平。论文提出“定义特定熟悉度”(Definition-Specific Familiarity, DSF),用于衡量模型内部概念与任务定义之间的对齐程度。控制数据集层面混杂因素后,DSF与模型性能呈显著正相关(偏相关系数 r = +0.41),而三种文本级记忆度量(ROUGE-L、BERTScore、嵌入余弦相似度)均未表现出正向关联。研究表明,基于提示的纠错策略在标注任务中存在根本局限,模型性能的关键决定因素并非文本层面的记忆能力,而是任务定义与模型内部认知的对齐程度。
链接: https://arxiv.org/abs/2606.00467
作者: Etienne Casanova,Rafal Kocielnik,R. Michael Alvarez
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted at ICML 2026 (Oral Spotlight); PMLR vol. 306. 9 pages, 4 figures
Abstract:Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM’s familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors (“decision stickiness”), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model’s internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.
[NLP-209] Short-form Text Rewriting with Phi Silica
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在短文本重写(short-form text rewriting)任务中面临的语义保真度不足与幻觉鲁棒性差的问题。由于短文本具有上下文受限、语义密度高的特点,传统SLMs难以在保持原意的同时生成高质量的改写结果。其解决方案的关键在于通过数据集构建、提示蒸馏(prompt distillation)、参数高效微调(parameter-efficient fine-tuning)等策略,对Phi Silica这一SLM进行针对性适配。研究团队从公开幻灯片中构建了短篇展示型文本数据集,并利用GPT-5-chat生成重写监督信号及作为大模型评判基准(LLM-as-a-judge),最终实验证明,经过优化后的模型在语义保真度、幻觉抑制以及人类偏好得分上均显著优于原始模型,甚至在与GPT-5-chat重写结果的对比中取得更高胜率。研究表明,针对特定任务的精细化适配可有效缩小SLMs与云端大模型之间的性能差距,为高精度重写场景下SLMs的实际应用提供了可行路径。
链接: https://arxiv.org/abs/2606.00462
作者: Divya Tadimeti,Shawn Pan,Sameera Lanka,Chenghui Zhou,Sadid Hasan
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages
Abstract:Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.
[NLP-210] SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
【速读】: 该论文旨在解决生成式语音大模型(Speech-aware large language models)在域外场景下泛化能力差的问题。其核心解决方案是提出一种轻量级适配方法SALSA(Speech-Aware LLM Adaptation via Learned Steering Activations),通过学习逐层的引导向量(steering vectors)来实现对模型的高效微调。与依赖对比激活差异的传统引导方法不同,SALSA直接采用监督目标优化引导向量,从而更精准地调整模型内部表示。在儿童语音、多语言语音以及中英混合语码转换等跨域基准测试中,SALSA显著优于零样本推理和语音上下文学习基线,最高实现46.8%的相对性能提升。进一步分析表明,对编码器(尤其是深层)进行引导比对语言模型主干(LLM backbone)的引导更为有效,说明该方法通过将高层声学与音位表征更好地对齐预训练语言模型的表示空间,从而提升下游自动语音识别(ASR)性能,而非通过修改解码器实现。
链接: https://arxiv.org/abs/2606.00460
作者: Yekaterina Yegorova,Argyrios Gerogiannis,Haolong Zheng,Julia Hockenmaier,Chang D. Yoo,Mark A. Hasegawa-Johnson
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄本那-香槟分校); Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech-aware large language models often generalize poorly to out-of-domain settings. We propose SALSA (Speech-Aware LLM Adaptation via Learned Steering Activations), a lightweight adaptation method that learns layer-wise steering vectors. Unlike commonly used steering approaches that rely on contrastive activation differences, SALSA directly optimizes steering vectors using a supervised objective. Across children’s speech, multilingual speech, and Mandarin-English code-switching benchmarks, SALSA substantially improves performance over zero-shot inference and speech in-context learning baselines, achieving up to 46.8% relative improvements over zero-shot. Analysis further demonstrates that steering the encoder, particularly the later layers, is more effective than steering the LLM backbone. These findings suggest that steering improves downstream ASR performance by adapting higher-level acoustic and phonetic representations to better align with the pretrained language model representation space, rather than by modifying the decoder itself.
[NLP-211] ProtStructQA: A Denotation Threshold in Protein Structural Reasoning
【速读】: 该论文旨在解决蛋白质语言模型在面对结构化生物学问题时,其生成的自然语言回答是否真正对应可执行的三维结构度量这一关键挑战。传统评估仅关注文本的生物合理性,而忽略了语义与实际结构测量之间的精确映射。为此,论文提出ProtStructQA,一个可执行的蛋白质结构问答基准,其中每个自然语言问题均由隐藏的类型化领域特定语言(DSL)程序生成,答案通过在AlphaFold预测结构上执行该程序获得。该基准包含38.22万道问题,覆盖置信度、距离、预测对齐误差(PAE)、溶剂可及性、二级结构、拓扑和接触等多维度信息,并划分出33万道活跃基准题与5.22万道困难负样本鲁棒性测试集。实验表明,在未微调条件下,不同规模的Qwen3模型(0.6B至8B)在直接提示、思维链(chain-of-thought)、语法约束可执行投票、带思维链的可执行投票及多轮ReAct式工具使用等多种策略下表现各异。研究发现存在一个能力阈值:在Qwen3-1.7B与4B之间,低于该阈值时,模型难以生成可解析的可执行表达式,工具辅助的ReAct策略占优;高于该阈值后,思维链从多数有害转为显著有益,成为多数任务下的最优策略。解析失败分析与家族级分析进一步揭示,该阈值标志着从不可解析的语言表达向可执行结构语义映射的转变。此外,语法约束与执行机制对PAE和二级结构查询仍具选择性价值。综上,ProtStructQA将科学问答重构为从语言到可计算测量的编译过程,为评估大模型能否将词语准确映射为三维结构可执行度量提供了诊断性测试平台。
链接: https://arxiv.org/abs/2606.00451
作者: Aravind Mandiga,Guoming Li,Jin Lu,Ismailcem Budak Arpinar,Khaled Rasheed,Samuel E. Aggrey
机构: University of Georgia (佐治亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Protein-language systems are often evaluated by whether they generate plausible biological text, but a structural question has a sharper semantics: it denotes a measurement in a 3D coordinate system. We introduce ProtStructQA, an executable benchmark for protein structural question answering in which each natural-language question is generated from a hidden typed domain-specific language (DSL) program and the answer is obtained by executing that program on an AlphaFold-predicted structure. ProtStructQA releases 382.2K questions covering confidence, distances, predicted aligned error (PAE), solvent exposure, secondary structure, topology and contacts, and held-out compositions: a 330K active benchmark over 10K proteins from four species, plus a 52.2K hard-negative robustness pool. Without fine-tuning, we evaluate Qwen3 models from 0.6B to 8B under direct prompting, chain-of-thought, grammar-constrained executable voting, executable voting with chain-of-thought, and multi-turn ReAct-style tool use, and replicate the headline finding on Gemma-3-1B and Gemma-3-12B. We find a capability-dependent denotation threshold between Qwen3-1.7B and Qwen3-4B: below it, tool-mediated ReAct dominates because models often fail to produce executable denotations; above it, chain-of-thought flips from mostly harmful to strongly beneficial and becomes the strongest strategy on most splits. Parse-failure and family-level analyses show that the threshold is a transition from unparseable language to executable structural denotation, while grammar and execution remain selectively valuable for PAE and secondary-structure queries. ProtStructQA reframes scientific QA as compilation from language to measurement and provides a diagnostic testbed for when language models can map words to executable 3D structural measurements.
[NLP-212] Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters ICML2026
【速读】: 该论文旨在解决低秩适配器(Low-rank adapters)在参数预算分配上因离散的秩(rank)步长导致的容量增量粗糙问题。传统方法如LoRA以整数阶为单位增加可训练参数,例如在2048×2048的OPT注意力投影中,每增加一秩即引入4096个可训练标量,造成参数预算与性能之间的显著间隙。为此,论文提出采用固定组件的张量分解形式——标准多项式分解(Canonical Polyadic, CP)张量适配器,通过将适配器参数化为多个低秩成分的叠加,实现更细粒度的参数预算控制。在32×64×32×64的张量化结构下,每个归一化CP成分仅需193个可训练标量,相比LoRA单秩增量缩小了约21倍,从而显著细化了参数预算的调整粒度。实验在OPT-1.3B模型上针对SST-2、RTE和BoolQ三个任务,在匹配目标模块、训练协议、数据上限和随机种子调度的前提下,对比了CP适配器与LoRA的表现。结果表明,CP适配器训练稳定,有效填补了LoRA各秩之间的性能空白,但其性能提升具有任务依赖性:在SST-2上早期达到低预算平台期,BoolQ在额外添加CP成分后性能优于LoRA但略低于其饱和水平,而RTE仍保持对LoRA的偏好。因此,更精细的参数增量有助于更精确地诊断参数效率微调(PEFT)的预算敏感性,但并不必然带来整体更优的精度-预算权衡曲线。
链接: https://arxiv.org/abs/2606.00428
作者: Xinjue Wang,Xiuheng Wang,Yejun Zhang,Sergiy A. Vorobyov,Esa Ollila,Zhi-Yong Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at the ICML 2026 Workshop on CoLoRAI
Abstract:Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a 2048\times2048 OPT attention projection, increasing LoRA by one rank stores 4096 trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy–budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a 32\times64\times32\times64 tensorization, one normalized CP component stores 193 trainable scalars per projection, about 21 times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy–budget curve.
[NLP-213] he Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary ICML2026
【速读】: 该论文旨在解决生成式智能体在确定性状态追踪任务中,因采用扩展的思维链(Chain-of-Thought, CoT)推理而导致性能下降的问题。其核心挑战并非源于偏好偏差,而是由仅解码器架构(decoder-only)注意力机制所固有的信息论容量瓶颈所致。解决方案的关键在于揭示并量化这一根本限制:首先,提出“注意力瓶颈定理”(Attention Bottleneck Theorem),通过互补的可实现构造,将状态追踪能力上限界定为 $ O(H \cdot \log(L/H) \cdot \sqrt{d_h}) $,其中 $ H $ 为状态数,$ L $ 为上下文长度,$ d_h $ 为模型维度;其次,建立依赖于上下文的状态误差模型,揭示准确率呈现超指数级衰减特性;进一步引入“状态空间 Jaccard”(State-Space Jaccard)度量,以区分能力不足与偏好错误;最终,推导出确定性推理的有效极限 $ d^* \in [19, 31] ,超过此阈值则必须依赖工具调用。实验验证表明,在12个模型和8个任务领域(如SWE−Bench、WebArena、SQL−Multi)中,融合工具的混合推理方式显著优于纯神经思维链,准确率可达86–94 r = 0.81 - 0.91 $)表明失败本质源于架构而非训练差异。研究为智能体系统中何时应从纯神经推理转向混合范式提供了严谨的理论指导。
链接: https://arxiv.org/abs/2606.00376
作者: Dongxin Guo,Jikun Wu,Siu Ming Yiu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ICML 2026. 4 figures. 51 pages including appendices
Abstract:Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as O(H \cdot \log(L/H) \cdot \sqrtd_h) ; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon d^* \in [19, 31] beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields 5% improvement, confirming an architectural ceiling, and high cross-model correlation ( r = 0.81 - 0.91 ) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.
[NLP-214] How Far Do Auto-Interpretation Labels Generalize: A Controlled Study Across Languages Scripts and Rewordings
【速读】: 该论文旨在解决生成式语言模型中稀疏自编码器(Sparse Autoencoder, SAE)特征的可解释性问题,特别是其自动标注的自然语言标签在跨语言与跨书写系统时是否具备泛化能力。核心问题是:一个在某种语言中被标注为表征特定语义概念的特征,是否能在其他语言或书写系统中仍准确追踪该概念?研究以塞尔维亚语的双文字系统(拉丁字母与西里尔字母之间的确定性转写)作为受控实验环境,发现不同书写形式下激活的SAE特征集具有显著重叠(最高Jaccard相似度达0.57,远高于随机基线0.13),表明这些特征确实具备跨语言语义表征能力。然而,进一步测试显示,自动标注标签的准确性在塞尔维亚语中明显下降——尤其是对西里尔字母文本的误判率比英语高4倍以上,且西里尔字母的误判率高于拉丁字母,尽管二者为确定性转写关系。这一现象揭示了标签性能与训练数据中不同书写形式的代表性程度密切相关。随着网络深度增加,标签失效问题加剧,但标签本身并未提供任何失败预警。因此,研究指出当前自动标注标签可能反映的是特征在训练数据中常见输入上的行为模式,而非其所声称的深层语义概念本身,提示现有自动解释机制存在严重的语言与书写系统偏差。
链接: https://arxiv.org/abs/2606.00356
作者: Sripad Karne
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sparse autoencoder (SAE) features are increasingly used to interpret language models, with auto-generated natural-language labels serving as the primary interface for understanding what each feature represents. We ask whether these labels generalize: does a feature labeled for a concept actually track that concept across languages and scripts? Using Serbian digraphia as a controlled testbed – the same language written in both Latin and Cyrillic via deterministic transliteration – we first find that SAE feature sets activated by the same content in different languages, scripts, and wordings share substantial overlap (peak Jaccard similarity 0.57 vs.\ 0.13 random baseline), suggesting genuine cross-lingual semantic features. We then test whether auto-interpretation labels keep pace. They often do not: features whose labels describe semantic content miss the same meaning in Serbian up to 4\times more often than within English, and miss Serbian Cyrillic more than Serbian Latin – two scripts that are deterministic transliterations of each other – suggesting the failures track how well each form is represented in training. The gap grows with network depth, yet the labels give no indication that they fail. These results suggest that auto-interpretation labels may reflect a feature’s behavior on well-represented inputs rather than the concept itself.
[NLP-215] Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在偏好学习(preference learning)阶段出现的语言使用偏差问题,尤其是由人类反馈强化学习(Reinforcement Learning from Human Feedback, RLHF)引发的词汇层面的系统性偏移。此类偏移表现为模型对特定表达格式或高频词汇(如“delve”、“furthermore”)的过度依赖,即使这些模式在基础模型输出中并不存在。现有研究受限于人工标注的高成本与主观性,难以高效、客观地识别此类偏差。为此,本文提出“三角化偏好迁移评分”(Triangulated Preference Shift score),通过融合人类黄金标准、基础模型与指令微调版本三者之间的对比,实现对偏好学习导致的行为变化的自动化量化分析,无需依赖人工标注。该方法在六个模型家族的数据上进行了验证,并揭示了偏好学习可能使模型趋向于一种可被解读为“精英语言”(language of prestige)的表达风格。该指标为评估和优化模型对齐性提供了首个自动化的量化工具,有助于提升生成式 AI 的可信度与可控性。
链接: https://arxiv.org/abs/2606.00334
作者: Xiaoyang Ming,Jose Hernandez,Thomas Stephan Juzek
机构: Florida State University (佛罗里达州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages, 2 figures, 1 table
Abstract:Various language domains have undergone remarkable changes in recent years; these shifts are largely attributed to the advent of Large Language Models and their misalignment with natural language usage. These misalignments are thought to partly originate in the preference-learning stage, e.g. Reinforcement Learning from Human Feedback, which generally makes models more useful but simultaneously may introduce systematic lexical bias. In terms of lexical behavior, this is visible in a model’s preference for certain formats or the overuse of words (delve, furthermore), even when such patterns are not present in base model outputs. Research on lexical misalignment induced during preference training is constrained by reliance on manual curation. We address this, by introducing the Triangulated Preference Shift score, a metric that triangulates between human gold standards, base models, and instruct variants to isolate shifts induced specifically by preference learning, without manual curation. We provide data across six model families, anchor the results in the literature, and illustrate the general approach’s utility by analyzing whether preference learning shifts models toward what could be interpreted as a “language of prestige”. The metric provides an initial automated method to quantify behavioral shifts attributable to preference tuning, and thus, may help inform model alignment and development of trustworthy AI.
[NLP-216] Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLM s
【速读】: 该论文旨在解决生成式 AI(Generative AI)在处理法律与行政事务类问题时,因用户输入语言与实际适用法域不一致而引发的制度框架误判问题。当用户未明确指定国家或地区时,模型可能默认以输入语言所对应的语言区作为法域依据,从而导致输出结果偏离用户实际所需的法律体系。其解决方案的关键在于揭示并量化这种“机构框架误选风险”(institutional-framework misselection risk):即多语言用户使用非目标法域语言提问时,模型仍倾向于依据输入语言自动推断法域,造成回答偏差。研究通过在中美两国开发的七款大模型上进行跨语言、跨系统提示条件的实证评估,发现中文输入更易触发中国法律框架回应,英文输入则更常导向美国或通用性答案;尤其在要求单一答案的场景下,74.5%的英文输入响应采用美国框架,53.3%的中文输入响应采用中国框架,且该趋势在所有模型中均显著存在。因此,论文主张:大模型接口不应仅依赖输入语言进行制度推理,而应在用户地理位置缺失时主动请求确认或明示回答所依据的法域范围,以提升法律建议的准确性与可信赖性。
链接: https://arxiv.org/abs/2606.00333
作者: Zhizhi Wang,Harini Suresh
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:LLMs increasingly answer questions about taxes, labor protections, healthcare, education, pensions, and administrative procedures, where usefulness often depends on the applicable jurisdiction. Multilingual users may write in their most comfortable language rather than one associated with the country or region whose rules apply. We ask whether deployed LLMs use input language as a default jurisdictional signal when prompts omit any country or region. Prior multilingual audits show that prompt language can shift cultural, political, or normative outputs; we examine which legal-administrative framework models supply when jurisdiction is underspecified. We evaluate seven LLMs developed in the United States or China on 60 underspecified legal-administrative prompts in English and Mandarin Chinese under three system-prompt conditions, yielding 2,520 manually annotated responses. Across models and conditions, Chinese input more often produces China-specific answers, while English input more often produces U.S.-specific, comparative, or generic answers. Prompts requiring a single answer further increase jurisdiction selection: pooled across models, 74.5% of English-input responses adopt a U.S. framework, while 53.3% of Chinese-input responses adopt a China framework. This directional pattern appears in all seven models. We describe this deployment-level pattern as institutional-framework misselection risk: a fluent answer may rely on a legal-administrative context the user did not intend, especially when their preferred language differs from the relevant jurisdiction. LLM interfaces should not route institutional advice by input language alone; when location is absent, they should request it or state the jurisdictional scope of the answer.
[NLP-217] Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
【速读】: 该论文旨在解决生成式 AI(Generative AI)中基于策略的蒸馏(On-Policy Distillation, OPD)方法在提升大语言模型推理能力时存在的根本性局限:尽管OPD采用教师监督下的自策略采样轨迹进行训练,其学习信号仍局限于逐标记(token-level)层面,无法有效识别和修复真正导致推理路径偏离的深层分歧状态。其关键问题在于,高损失标记中约30%位于低分歧区域,多为表面形式不匹配而非实质性推理分支,且孤立的标记级逆KL修正难以应对短时程分布漂移引发的推理失败。为此,论文提出轨迹感知的OPD(Trajectory-aware OPD, TOPD),通过引入近未来轨迹信息来识别真实的分歧状态,并将指导信号跨多个未来标记进行分布,从而实现更精准的纠错与路径对齐。实验表明,TOPD显著优于标准OPD,平均准确率由47.8%提升至52.2%,在AIME24和AIME25上分别达到63.3%和53.3%的性能提升,验证了轨迹级上下文感知在推理优化中的关键作用。
链接: https://arxiv.org/abs/2606.00305
作者: Yuxuan Jiang,Francis Ferraro
机构: University of Maryland, Baltimore County
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:On-Policy Distillation (OPD) improves large language model reasoning by training a student model on trajectories sampled from its own policy under teacher supervision. Although OPD operates on trajectories, its learning signal remains token-level: it identifies deviations through high-loss tokens and repairs them through local reverse-KL correction. We show that this “trajectory-sampled but token-learned” mechanism cannot reliably bridge student trajectories toward teacher trajectories. About 30% of high-loss tokens fall into the low-divergence regime, indicating that many are surface-form mismatches rather than real reasoning forks. Moreover, even truly divergent tokens are difficult to repair with isolated token-level supervision, since reasoning failures often unfold as short-horizon distributional drift. We propose Trajectory-aware OPD (TOPD), which uses near-future trajectory information to identify real divergent states and distribute guidance across multiple future tokens. Experiments show that suppressing non-divergent high-loss tokens improves standard OPD from 47.8% to 48.2% average accuracy, while TOPD further improves performance to 52.2%, with gains on AIME24 from 60.0% to 63.3% and AIME25 from 46.7% to 53.3%.
[NLP-218] Uncovering Temporal Framing in the News ACL2026
【速读】: 该论文旨在解决新闻话语中时间语言的修辞功能问题,即如何通过过去、现在和未来的时间表述构建意义并实现说服效果,而不仅仅是描述事件的时间顺序。其核心挑战在于识别和分析“时间框架”(temporal framing)——一种基于时间相关语言进行意义建构而非单纯记录时间序列的修辞策略。解决方案的关键在于提出一个基于前人时态与框架理论的八类时间框架分类体系,并通过专家标注构建了一个多语言新闻语料库,涵盖458篇英德文新闻文章,包含超过2000个具有时间框架特征的句子及约3000个标注。研究进一步通过监督微调和零样本分类方法评估了时间框架的可检测性,结果表明在句子层面,时间框架具备可学习性,且监督模型显著优于零样本方法。该工作为后续研究提供了公开可用的高质量语料库,推动了对时间修辞机制的深入探索。
链接: https://arxiv.org/abs/2606.00294
作者: Tarek Mahmoud,Veronika Solopova,Premtim Sahitaj,Ariana Sahitaj,Max Upravitelev,Mervat Abassy,Hana Fatima Shaikh,Neda Foroutan,Vera Schmitt,Preslav Nakov
机构: 未知
类目: Computation and Language (cs.CL)
备注: ACL 2026 Main Conference Oral
Abstract:Temporal language does more than place events on a timeline. In news discourse, references to the past, present, and future can function as rhetorical devices that shape interpretation and persuasion. Here, we study temporal framing, defined as the persuasive use of time-related language to structure meaning rather than to report chronology. We propose a taxonomy of eight temporal frames grounded in prior work on temporality and framing, and we realize it through expert annotation of a multilingual news corpus. The resulting dataset includes 458 English and German news articles, with over 2K temporally framed sentences and approximately 3K temporal framing annotations identified from a corpus of more than 20K sentences. We analyze frame prevalence, co-occurrence patterns, and lexical cues, and evaluate temporal framing detection using supervised fine-tuning and zero-shot classification. Our experiments show that temporal framing is learnable at the sentence level, with supervised models substantially outperforming zero-shot approaches. We publicly release the corpus to support future research on temporal framing: this https URL.
[NLP-219] Model-Based Quality Assessment for Massively Multilingual Parallel Data
【速读】: 该论文旨在解决大规模多语言双语语料库中存在的两大核心问题:非平行句对(non-parallel sentence pairs)以及低质量翻译(low-quality translations)。其解决方案的关键在于将基于模型的语料评估分解为两个独立但互补的组件:基于多语言嵌入的平行性评估(parallelism assessment)与无参考质量估计(reference-free quality estimation, QE)。研究通过在FLORES-200和BOUQuET检索任务上基准测试四种多语言嵌入模型,覆盖6,654个源-目标语言对;同时在41,412个有序语言对上评估九种无参考评价器,发现单一模型在所有语言方向上均不具备普遍可靠性。此外,简单的QE集成会稀释强模型信号,而目标语言覆盖率与更高QE得分显著相关。研究表明,多语言平行数据评估应被视为一种面向具体语言对的方向感知(direction-aware)路由与校准问题,即不存在适用于所有语言对的通用度量标准。
链接: https://arxiv.org/abs/2606.00285
作者: Abdelaziz M.A. Ibrahim,Zihao Li,Jörg Tiedemann,Shaoxiong Ji
机构: University of Jyväskylä; University of Helsinki; ELLIS Institute Finland; University of Turku
类目: Computation and Language (cs.CL)
备注:
Abstract:Large-scale multilingual bitext often contains two distinct problems: non-parallel sentence pairs and low-quality translations. We decompose model-based assessment for such data into two independent components: parallelism assessment with multilingual embeddings and reference-free quality estimation (QE). For parallelism, we benchmark four embedding models on FLORES-200 and BOUQuET retrieval tasks, covering 6,654 source–target directions in our target language-pair inventory. For QE, we evaluate nine reference-free evaluators on professional FLORES-200 translations across 41,412 ordered source–target directions. Results show that no model is universally reliable across translation directions. Naive QE ensembles dilute strong model signals, while documented target-language coverage is strongly associated with higher QE scores. Overall, these findings suggest that multilingual parallel-data assessment is best approached as a direction-aware routing and calibration problem, where no single universal metric is expected to suffice across all languages.
[NLP-220] Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models
【速读】: 该论文旨在解决持续预训练(Continual Pretraining, CPT)过程中多语言场景下模型因灾难性遗忘(catastrophic forgetting)导致已有通用知识退化的问题。尽管以语系为单位组织训练可减少跨语言干扰,但仍无法有效防止对下游任务至关重要的通用知识丢失。研究将此现象归因于多语言CPT中参数漂移(parameter drift),并提出五种分层感知的参数对齐策略:硬层冻结、软正则化、事后权重回滚及模型融合。通过在涵盖五个语系共32种语言及其未见语言的基准测试上,从困惑度、阅读理解、物理推理和翻译四个维度系统评估,发现参数对齐显著降低了遗忘程度,且对语言习得影响极小:其中硬层冻结与软正则化最有利于保持阅读理解能力,而事后权重回滚在翻译任务上表现最优。研究成果揭示了语系专精型持续预训练中的“习得-遗忘”边界,并为不同任务匹配最优策略提供了实用部署指南。
链接: https://arxiv.org/abs/2606.00284
作者: Sanchit Ahuja,Terra Blevins
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 25 Pages, 5 Figures
Abstract:While continual pretraining~(CPT) is a practical way to extend large language models to new languages, naïve finetuning on targeted data erodes existing capabilities through catastrophic forgetting. Organizing training around language families reduces cross-language interference but cannot alone prevent forgetting of the general knowledge needed for downstream tasks. We link this forgetting to parameter drift in multilingual CPT and present a suite of five layer-aware parameter alignment strategies: hard layer freezing, soft regularization, post-hoc weight reversion, and model merging. We systematically compare our alignment strategies against two unregularized CPT baselines on benchmarks spanning 32 training languages from five language families, plus held-out languages, across four evaluation axes: perplexity, reading comprehension, physical reasoning, and translation. Parameter alignment substantially reduces forgetting at minimal cost to language acquisition: layer freezing and regularization best preserve comprehension, whereas post-hoc reversion yields the strongest translation gains. Together, these results map the acquisition–forgetting frontier for family-expert CPT and offer practical deployment guidelines pairing each strategy to the tasks it best serves.
[NLP-221] On Wednesdays We Ask Questions: Optimizing “Active Listening” in Automated Legal Triage and Referral
【速读】: 该论文旨在解决法律援助申请系统中如何有效生成高质量、可理解的跟进问题,以精准识别申请人法律问题核心匹配项的问题。其解决方案的关键在于通过引入一个低成本的大型语言模型(LLM)集成框架(FETCH分类器),自动生成有助于细化法律问题分类的自然语言提问。研究发现,尽管低成本模型在分类任务上表现良好,但在生成符合法律咨询场景、语义清晰且能有效引导用户披露关键事实的跟进问题方面存在局限;仅依赖提示工程(prompt engineering)不足以提升问题质量。进一步分析表明,基于大模型的“模型作为裁判”(LLM-as-judge)评分与人类专家评分存在显著差异,凸显了人工判断的重要性。最终,研究证明引入单一高成本模型(如GPT-5)可显著提升问题的相关性与信息获取效率,并改善分类准确率。此外,研究揭示不同法律类别(如家庭暴力)在事实采集上的不均衡现象,与现行家事法筛查协议不符,因此提出在特定法律领域应设立专门的筛查小组以保障筛查质量。
链接: https://arxiv.org/abs/2606.00272
作者: Quinten Steenhuis,Jacqueline Harvey
机构: Suffolk University Law School (萨福克大学法学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Working paper submitted as accepted to AIDA2J workshop at International Conference for AI and Law in Singapore, June 2026
Abstract:The FETCH classifier generates follow-up questions to help refine the best match for the applicant’s legal problem, using a low-cost ensemble of LLMs. In this paper, we describe an expert attorney and LLM-assisted evaluation of the follow-up question approach in FETCH and show that while low-cost LLMs perform well at classification tasks, generating high-quality plain-language questions in this setting appears to require a more sophisticated and higher-cost model. Through discussion with legal intake workers, we propose a rubric for the evaluation of legal intake classification questions, and we find that prompt engineering alone is not enough to improve question quality for intake purposes. We also find that LLM-as-judge and human ratings diverge. We demonstrate that with the addition of a single high-cost model, GPT-5, the classifier can elicit relevant information from applicants for legal help, and that the questions lead to more accurate performance at classification tasks. We also find uneven fact elicitation across different categories, including domestic violence, at odds with family law screening protocols, suggesting the value of including dedicated screening panels for certain areas of law.
[NLP-222] DeSQ: Decomposition-based SPARQL Query Generation
【速读】: 该论文旨在解决知识库问答(KBQA)领域中两类主流方法的固有缺陷:一是生成形式化查询的方法存在脆弱性且可解释性差,二是通过知识库(KB)探索直接检索答案的方法计算开销大且易产生幻觉。为此,论文提出一种无特定知识库依赖的框架DeSQ(基于分解的SPARQL查询生成),其核心解决方案在于三阶段流程:首先将复杂问题分解为反映底层知识库关系结构的原子约束(Atomic Constraints, ACs);其次生成两部分结构化输出——每项原子约束对应的标准SPARQL片段(使用标准化变量与统一资源标识符(URI)占位符)以及占位符的实体绑定(URIs Grounding)块;最后将各片段组装成完整的SPARQL查询。该方法在五个主流基准中的四个上超越现有最先进模型,并对词汇变化表现出更强鲁棒性。此外,由于采用结构化输出,该框架无需依赖实时知识库端点即可进行评估,同时支持细粒度错误分析,便于针对性优化。
链接: https://arxiv.org/abs/2606.00203
作者: Papa Abdou Karim Karou Diallo,Aditya Sharma,Neshat Elhami Fard,Amal Zouaq
机构: LAMA-WeST; Mila – Quebec AI Institute; Polytechnique Montréal
类目: Computation and Language (cs.CL)
备注:
Abstract:Dominant approaches to Knowledge Base Question Answering (KBQA) fall into two categories. First is the generation of a formal query that suffers from brittleness and limited explainability, and the second is direct answer retrieval through KB exploration that is computationally costly and prone to hallucination. To combine the strengths of both paradigms while mitigating their respective weaknesses, we introduce DeSQ (Decomposition-based SPARQL Query Generation), a KB-agnostic framework that operates in three stages. First, it decomposes complex questions into Atomic Constraints (ACs) that mirror the relational structure of the underlying KB. Second, it generates a two-part structured output: (a) Mapping of each AC to its corresponding SPARQL Fragment, using standardized variable and URIs placeholders, and (b) URIs Grounding block describing each placeholder. Third, it assembles these fragments into a complete SPARQL query. DeSQ surpasses state-of-the-art approaches on four out of five major benchmarks and demonstrates superior robustness to lexical variation. Beyond performance gains, our framework greatly simplifies evaluation by eliminating the need for a live KB endpoint, and its structured output enables fine-grained error analysis, allowing more targeted interventions for improvement.
[NLP-223] BAGEN: Are LLM Agents Budget-Aware?
【速读】: 该论文旨在解决当前智能体(Agent)在执行过程中对资源消耗缺乏前瞻性管理的问题,即现有评估体系仅在任务完成后才衡量成本,而未将预算作为动态控制信号进行实时调控。其核心挑战在于如何使智能体具备预算感知能力(Budget-Awareness),以实现对剩余资源的主动预测与风险预警。解决方案的关键在于提出“渐进式区间估计”(progressive interval estimation)框架:在每一步计划中,智能体需预测剩余预算的上下界,并在任务完成可能性较低时及时发出警报。实验通过滚动回放协议(rollout-replay protocol)验证了该机制的有效性,发现主流前沿模型普遍存在预算高估问题,且预算感知信号具有可训练性和可操作性——通过监督微调(SFT)与强化学习(RL)联合优化,可使失败轨迹减少28%-64%的令牌消耗;然而,尽管模型性能提升,预算区间的校准精度仍受限,覆盖率达上限仅为47%,表明精确区间建模仍是待突破的关键难点。
链接: https://arxiv.org/abs/2606.00198
作者: Yuxiang Lin,Zihan Wang,Mengyang Liu,Yuxuan Shan,Longju Bai,Junyao Zhang,Xing Jin,Boshan Chen,Jinyan Su,Xingyao Wang,Jiaxin Pei,Manling Li
机构: Northwestern University (西北大学); O2 Lab; Independent (独立); University of Michigan (密歇根大学); Cornell (康奈尔大学); All Hands AI (全部手人工智能); Stanford (斯坦福大学); UT Austin (德克萨斯大学奥斯汀分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: this https URL
[NLP-224] BOUTEF: A Multilingual Corpus for FakeNews in North Africa – Language as a Weapon
【速读】: 该论文旨在解决多语言、低资源语境下虚假新闻(fake news)在北非地区传播的监测与分析难题,尤其聚焦阿尔及利亚和突尼斯两国的虚假信息生态。其核心挑战在于缺乏覆盖多种语言变体(包括标准阿拉伯语、阿尔及利亚与突尼斯方言、阿拉伯化拉丁字母拼写法(Arabizi)、法语、英语及混合语码)的大规模、标注精准的多语言虚假新闻数据集。为此,研究提出构建BOUTEF——一个大规模多语言语料库,整合虚假叙事、真实叙事及其关联用户评论与经验证的辟谣信息,形成多维度、可追溯的数据基础。解决方案的关键在于通过构建这一兼具语言多样性与内容真实性标注的公开语料库,支持对虚假新闻的主题分布、语言与修辞策略、情感模式及社交互动机制的定量与定性结合的实证分析。研究发现,虚假新闻高度依赖情绪化叙事、夸张化框架与混合语言实践以增强传播力;而辟谣内容则呈现更注重事实核查的风格。此外,跨国家比较揭示了共性传播规律与受社会政治背景影响的国别差异,凸显了非正式语言使用在信息失序中的关键作用。该工作为虚假新闻检测、低资源语言处理及复杂语言环境中信息失序现象的研究提供了重要数据支撑与方法论基础。
链接: https://arxiv.org/abs/2606.00193
作者: Kamel Smaili,Yassine Toughrai,Amina Laggoun,David Langlois
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:The rapid spread of fake news on social media has become a major challenge, particularly in multilingual and under-resourced contexts such as North Africa. In this paper, we introduce BOUTEF, a large-scale multilingual corpus designed to study the propagation, characteristics, and impact of fake news in Algeria and Tunisia. The corpus integrates three complementary components: fake narratives, genuine narratives, and associated user-generated comments, along with verified debunking information. It covers a wide range of languages and linguistic varieties, including MSA, Algerian and Tunisian dialects, Arabizi, French, English, and code-switched language. Building on this resource, we conduct a comprehensive empirical analysis combining quantitative and qualitative approaches. We examine thematic distributions, linguistic and rhetorical strategies, sentiment patterns, and social engagement dynamics. Statistical analyses reveal significant associations between thematic categories and message veracity, as well as strong correlations between user engagement and the visibility of fake content. Our findings show that fake news relies heavily on emotionally charged narratives, sensational framing, and hybrid linguistic practices that enhance virality and audience engagement. In contrast, debunking content adopts a more factual and verification-oriented style. Furthermore, a comparative analysis between Algeria and Tunisia highlights both shared dynamics and country-specific characteristics shaped by sociopolitical contexts. The results emphasize the role of informal language practices in the diffusion and reception of misinformation. By providing a rich, annotated, and publicly available dataset, this work contributes to advancing research on fake news detection, low-resource language processing, and the understanding of information disorders in complex linguistic environments.
[NLP-225] RealityTest: How People Probe AI Identity and Whether Models Disclose It
【速读】: 该论文旨在解决生成式 AI(Generative AI)在对话场景中缺乏透明度的问题,即用户在与AI交互时难以判断对方是人类还是机器,而现有评估体系多局限于英文语境、依赖机器生成的问题且仅限于文本模态,无法真实反映实际使用中的披露行为。其解决方案的关键在于提出首个大规模多模态、多语言的现实基准测试框架RealityTest,基于来自49个国家、五种语言、约750名参与者的真实人类数据构建了包含3,152条身份探询问题的数据集(涵盖文本与语音场景),以更贴近真实世界中人们对AI身份的质疑方式。研究发现,仅有31%的人在模糊情境下会直接询问身份,且人类提问形式远比机器生成的问题多样化;在对17个文本模型和6个语音模型的测试中,披露行为存在显著差异,但即使在表现最优的模型中,仅通过一条抑制指令即可将披露率降至30%以下。研究进一步表明,问题的表述方式与对话上下文对披露行为的影响,甚至超过模型本身的选择,强调了基于多样化、真实人类数据构建安全评估体系的重要性,指出依赖狭窄或合成查询集的安全评估可能严重误判模型在真实部署环境中的表现。
链接: https://arxiv.org/abs/2606.00168
作者: Anna Gausen,Sarenne Wallbridge,Bessie O’Dell,Christopher Summerfield,Hannah Rose Kirk
机构: AI Security Institute(人工智能安全研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 4 figures
Abstract:AI systems are increasingly deployed in conversational settings where users may be uncertain whether they are speaking with a human or an AI. Despite mounting regulatory attention to this known safety risk, existing evaluations of AI disclosure are typically English-only, based on machine-generated questions, and restricted to text. We present RealityTest to comprehensively test whether AI systems disclose their identity when asked. The benchmark is the first large-scale multimodal and multilingual evaluation, grounded in human data on how people actually encounter and question AI identity in the real-world. Alongside the benchmark, we release the underlying dataset of 3,152 identity-probing queries collected from ~750 participants across 49 countries and five languages, in text and speech scenarios. We find that only 31% of people ask about identity directly in ambiguous scenarios, and that the questions people ask are far more diverse than machine-generated queries. We test 17 text and 6 speech models, and find substantial variation in disclosure behaviour. However, a single suppression instruction reduces disclosure rates to below 30%, even in the best-performing models. Validating our investment in diverse, human-grounded evaluation data, we find that how the question is phrased and the context of the conversation matter more for disclosure than which model is being tested. Safety evaluations built on narrow or synthetic query sets risk mischaracterising how models behave in realistic deployment settings.
[NLP-226] DataShield: Safety-degrading Data Filtering for LLM Benign Instruction Fine-Tuning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在使用良性数据集进行微调后仍可能出现安全能力退化的问题。现有方法在识别良性数据集中潜在的安全退化样本时,面临计算成本高和噪声干扰严重等挑战。本文提出的DataShield解决方案的关键在于:基于良性微调会提升模型整体响应合规性的观察,将每个样本对模型合规行为的贡献度量化为“安全退化评分”(safety degradation score),从而高效识别高风险样本。其核心技术包括:(1)合规向量提取(Compliance Vector Extraction),用于捕捉模型的合规行为倾向;(2)一种新型的合规感知评分(Compliance-Aware Score, CAS),可自动确定最敏感的安全关键层;(3)安全退化样本过滤机制,通过衡量训练数据在合规方向上的投影偏移来评估风险。实验在Llama3-8B、Llama3.1-8B和Qwen2.5-7B上使用Alpaca与Dolly数据集验证了该方法的有效性,结果表明开放式问答任务更易引发安全退化,且相关回复长度普遍更长。该工作为数据驱动的安全防御提供了新视角。
链接: https://arxiv.org/abs/2606.00160
作者: Junbo Zhang,Qianli Zhou,Xinyang Deng,Wen Jiang,Jie Pan,Jinbiao Zhu
机构: Northwestern Polytechnical University (西北工业大学)
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) suffer from degraded safety capabilities even when fine-tuned with benign datasets. However, existing methods for identifying safety-degrading samples in benign datasets suffer from high computational costs and significant noise issues. In this paper, we propose DataShield to efficiently and effectively identify potential safety-degrading samples. Our key intuition is based on the observation that benign fine-tuning increases the overall response compliance of LLMs. DataShield’s key technical insight is to quantify each sample’s contribution to the model’s compliance behavior as its safety degradation score. DataShield consists of three core components: (1) Compliance Vector Extraction, which captures the LLM’s compliance behavior tendency; (2) a novel Compliance-Aware Score (CAS), which automatically identifies the optimal safety-critical layer; and (3) Safety-degrading Sample Filtering, which quantifies the projection shift of training data along the compliance direction. Extensive experimental evaluation on Llama3-8B, Llama3.1-8B, and Qwen2.5-7B using the Alpaca and Dolly benign datasets validates our method’s effectiveness in identifying high-risk and low-risk data subsets. We also observe that open-ended question answering is more likely to trigger safety degradation, and corresponding responses tend to be longer. We hope this work can provide new insights into data-centric defense methods. The source code is available at: this https URL.
[NLP-227] Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey
【速读】: 该论文旨在解决生成式人工智能(Generative AI, GenAI)加速传播对抗性合成内容所导致的传统被动检测方法失效的问题。其核心挑战在于如何在虚假叙事大规模扩散前实现早期识别与干预,以应对日益复杂的网络信息操纵威胁。解决方案的关键在于提出一种基于生命周期的统一分类框架——C5交互模型(Context, Causes, Content, Cycle of Amplification, Consequences),将社会技术系统中对抗性传播活动的演化过程与先进的计算检测方法深度融合。通过该模型,论文系统整合了机器学习与社会科学的研究成果,重点发展针对合成内容生成、播种及传播阶段的主动检测技术,包括协调性不真实行为(Coordinated Inauthentic Behavior, CIB)分析、流行病学建模以及霍克斯过程(Hawkes Process)等动态传播建模方法。同时,针对高维嵌入空间中的异常检测、多层图上的无监督协同模式识别,以及基于智能体的生成式AI系统,构建了多层次、前瞻性的防御机制。最终,论文揭示了生成式人工智能带来的威胁快速演化与多层级分布漂移等挑战,并提出未来研究应聚焦于异常簇的识别与可预见、强韧的信息生态系统的建设,从而推动从被动响应向主动预防的根本范式转变。
链接: https://arxiv.org/abs/2606.00136
作者: Jonghyun Chung,Rishabh Chaddha,Sanket Badhe,Debanshu Das,Nathan Huang,Amanpreet Kaur
机构: Google LLC(谷歌)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Social and Information Networks (cs.SI)
备注: 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)
Abstract:The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.
[NLP-228] Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization ACL2026
【速读】: 该论文旨在解决低资源多语言环境下法律文档分类与摘要生成中的关键挑战,包括领域语言特性、多语言混杂性、上下文长依赖关系以及类别不平衡等问题。其核心解决方案在于提出一种基于Kolmogorov-Arnold Network(KAN)的双向门控循环单元(BiGRU)架构,通过引入可学习的非线性映射能力更强的KAN模块,有效捕捉法律文本中的复杂语义结构。在分类任务中,KAN模块显著提升了模型性能,使准确率从基线57.34%提升至67.96%;在摘要生成部分,则采用注意力机制增强的GRU结合KAN作为输出头,实现了对多语言法律文档的有效压缩与关键信息提取。实验结果表明,该方法在多个评价指标上优于传统机器学习算法及预训练语言模型,验证了KAN在处理低资源多语言法律文本任务中的优越性。
链接: https://arxiv.org/abs/2606.00116
作者: Ahmed Faizul Haque Dhrubo,Souvik Pramanik,Most. Aysha Siddika Sumona,Shahnewaz Siddique,Mohammad Ashrafuzzaman Khan,Mohammad Abdul Qayum,Mohsin Sajjad
机构: North South University (北方南大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026
Abstract:This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.
[NLP-229] Bridging the 2D-3D Gap: A Hierarchical Semantic-Geometric Map for Vision Language Navigation
【速读】: 该论文旨在解决视觉-语言导航(Vision-Language Navigation, VLN)任务中存在的重要语义-几何鸿沟问题:尽管视觉-语言模型(VLMs)在2D视觉理解和自然语言理解方面表现优异,但在3D空间推理能力以及动作与空间变换之间因果关系建模方面存在明显不足,导致在零样本(zero-shot)场景下导航结果不可靠。其解决方案的关键在于提出一种分层语义-几何地图(Hierarchical Semantic-Geometric Map, HSGM),将三维几何信息转化为与VLM兼容的结构化表示,从而实现模型与物理世界的有效对齐。HSGM采用三级层次结构:几何层记录可通行区域与障碍物,语义层表征物体及其空间关系,决策层支持高层任务推理与目标选择。在导航过程中,VLM作为高层语义规划器,基于HSGM编码的空间布局生成几何上有效的路径点,而路径点间的低层无碰撞移动则由经典路径规划算法独立执行,实现了语义推理与动作执行的完全解耦。此外,通过将复杂指令分解为子任务,有效缓解了长时程导航中的进展遗忘或幻觉问题。在R2R-CE和RxR-CE基准上的大量实验表明,所提零样本框架达到当前最优性能,甚至超越部分监督方法。
链接: https://arxiv.org/abs/2606.00095
作者: Kailing Li,Tianwen Qian,Lijin Yang,Yuqian Fu,Jingyu Gong,Xiaoling Wang,Liang He
机构: East China Normal University (华东师范大学); Bosch Corporate Research (博世企业研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:
Abstract:Vision-Language Navigation (VLN) enables embodied agents to reach target locations in unseen environments by following language instructions. Despite recent progress with vision-language models (VLMs), a critical semantic-geometric gap remains: while VLMs excel at language and 2D visual understanding, they struggle with 3D spatial reasoning and fail to capture the causal dynamics between actions and spatial transitions, resulting in unreliable navigation, particularly in zero-shot settings. To bridge this gap, we propose a Hierarchical Semantic-Geometric Map (HSGM) that transforms 3D geometric information into a structured representation compatible with VLMs, effectively linking them to the physical world. Specifically, HSGM is represented as a multi-channel top-down map organized into three levels: (1) geometric level that records navigable regions and obstacles, (2) semantic level that represents objects and their relations, and (3) decision level that supports high-level task reasoning and goal selection. During navigation, the VLM acts as a high-level semantic planner, interpreting the spatial layout encoded in the HSGM to select geometrically valid waypoints, while low-level, collision-free movements between waypoints are executed by a classical path-planning algorithm, fully decoupling semantic reasoning from action execution. Additionally, complex instructions are decomposed into subtasks to alleviate the problem of progress forgetting or hallucinating in long-horizon navigation. Extensive experiments on R2R-CE and RxR-CE benchmarks demonstrate that our zero-shot framework achieves state-of-the-art performance and even outperforms several supervised methods. Code is available at this https URL.
[NLP-230] DLLM -JEPA: Joint Embedding Predictive Architectures for Masked Diffusion Language Models ICML2026
【速读】: 该论文旨在解决自监督视觉表征学习中引入的生成式语言模型(Generative AI)架构在迁移至自然语言处理任务时所面临的双重高成本问题:一是需要显式的多视图数据对(如文本-代码对),二是每步训练需两次携带梯度的前向传播,导致计算开销巨大。其解决方案的关键在于提出DLLM-JEPA框架,通过将生成式自编码器(JEPA)与掩码扩散语言模型(Masked-Diffusion Language Model, DLLM)相结合,利用扩散模型双向注意力机制在不同掩码率下天然生成语义上区分的双视图输入,从而无需依赖外部显式多视图数据对;同时,该设计仅需一次携带梯度的前向传播,相较于LLM-JEPA降低了33%的训练浮点运算量(FLOPs)。实验表明,DLLM-JEPA在多个任务和模型架构组合中均显著优于纯扩散模型微调,最大提升达+18.7个百分点(LLaDA-8B GSM8K)和+11.4个百分点(Dream-7B GSM8K),并在Spider、NL-RX-SYNTH和Django等任务上保持一致正向增益。此外,该方法展现出“双优”特性:在不牺牲通用知识能力(如MMLU准确率)的前提下,同时提升下游任务性能并降低未见数据(如Wikitext)的损失,层间探查揭示其内在机制为“几何-功能漂移解耦”现象——微调后的骨干网络虽远离预训练权重,但对未见数据的遗忘程度更低,且该效应集中于中间Transformer层,具有跨模型泛化性。
链接: https://arxiv.org/abs/2606.00091
作者: Sangdae Nam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 4 figures, 13 tables. Accepted at SPIGM Workshop, ICML 2026
Abstract:Joint Embedding Predictive Architectures (JEPAs) have reshaped self-supervised representation learning in vision. The recent LLM-JEPA ported JEPA to autoregressive language models but inherited two steep costs from the causal-attention substrate: it demands explicit multi-view data (e.g., text-code pairs), and it requires two gradient-carrying forward passes per step. We introduce DLLM-JEPA, which pairs JEPA with masked-diffusion language models to eliminate both costs at once. The bidirectional attention of diffusion models yields two semantically distinct views of the same input via different masking rates – no explicit pairs needed – and supports a single gradient-carrying forward pass, cutting training FLOPs by 33% relative to LLM-JEPA. DLLM-JEPA improves over diffusion-only fine-tuning in every (task, architecture) combination we evaluate: up to +18.7 pp on LLaDA-8B GSM8K and +11.4 pp on Dream-7B GSM8K, with consistent positive gains on Spider, NL-RX-SYNTH, and Django. Beyond accuracy, DLLM-JEPA exhibits a dual-win property: on LLaDA-8B with the Wide-t configuration, it simultaneously raises GSM8K accuracy (67.1 vs. 65.2, +1.8 pp), drives held-out Wikitext loss below the pre-trained base, and preserves MMLU accuracy at base level across three fine-tuning seeds – whereas an L2-to-base parameter anchor matches baseline accuracy with no task gain. Layer-wise probing reveals the mechanism: a geometric-functional drift dissociation in which the fine-tuned backbone moves further from the pre-trained weights than the baseline yet forgets less on held-out Wikitext, with the amplification concentrated in middle transformer layers. The pattern appears on Dream-7B as well, indicating the phenomenon is not specific to a single backbone.
[NLP-231] Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study
【速读】: 该论文旨在解决传统基于向量的检索增强生成(Retrieval-Augmented Generation, RAG)系统在金融领域多实体关系分析中的局限性,即难以有效捕捉金融市场分析中至关重要的结构化、多实体间复杂关联。其核心解决方案是提出一种两跳图结构RAG(Two-hop Graph-RAG)架构,通过构建情感加权的知识图谱,将59个股票实体与255篇科技类新闻文章关联,并引入基于“影响”(INFLUENCES)边的强度过滤图遍历机制,以挖掘向量检索无法获取的关系性证据。关键创新在于将密集检索与受控图遍历相结合,在保持答案质量基本不变的前提下,显著提升了多实体关系型查询的实体召回率(+6.4%,p < 0.001)和答案相关性(+11.7%),尤其在关系型问题上提升达+16.1%。同时,研究揭示了图遍历强度阈值存在倒U型优化关系,确定τ = 0.5为最优参数,为实际应用提供了可操作的架构指导。
链接: https://arxiv.org/abs/2606.00062
作者: Rajan Bastakoti,Sagar Bhetwal,Nirajan Acharya,Gaurav Kumar Gupta
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Retrieval-Augmented Generation (RAG) has become foundational for grounding large language models in domain-specific corpora, yet conventional vector-based RAG systems are fundamentally limited in their ability to capture the structured, multi-entity relationships that underpin financial market analysis. This paper presents a comprehensive comparative study of a novel two-hop Graph-RAG architecture versus a standard vector-only baseline for cross-entity financial sentiment analysis. Our system constructs a sentiment-weighted knowledge graph of 59 equity entities from 255 news articles covering 10 major technology stocks, then augments dense retrieval with intensity-filtered graph traversal over INFLUENCES edges to surface relational evidence inaccessible to vector search alone. We evaluate both architectures on 100 grounded queries (30 Direct, 70 Relational) using semantic similarity, entity recall, RAGAS metrics, latency benchmarks, and ablation studies. Graph-RAG achieves a statistically significant improvement in entity recall (+6.4%, p 0.001, Wilcoxon signed-rank) and delivers substantially more relevant answers for complex multi-entity queries (+11.7% Answer Relevancy), with gains concentrating in relational question types (+16.1%). Critically, these improvements come at no measurable cost to answer quality (delta = +0.001 semantic similarity, Cohen’s d = 0.078), with a modest 22.6% increase in mean latency offset by an 80% reduction in latency variance. An ablation study on the graph traversal intensity threshold reveals an inverted-U relationship with answer quality, identifying tau = 0.5 as optimal over the production default of tau = 0.7. These findings characterize a precision-for-coverage trade-off inherent to graph-augmented retrieval and provide actionable architectural guidance for practitioners building RAG systems for multi-entity financial analysis. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2606.00062 [cs.CL] (or arXiv:2606.00062v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.00062 Focus to learn more arXiv-issued DOI via DataCite
[NLP-232] he Invisible Coalition Partner: How LLM s Vote When Democracy Gets Concrete
【速读】: 该论文旨在解决现有研究中关于指令微调的大规模语言模型(Large Language Models, LLMs)存在左倾政治偏见的结论是否适用于真实政策决策场景的问题。此前的研究仅基于抽象政治问卷测量模型偏见,而未能验证其在具体政策投票情境下的适用性。本文的关键解决方案在于提出一种双工具(dual-instrument)方法,结合瑞士直接民主制度中的真实联邦公投(Volksabstimmungen)数据,构建了一个涵盖四种官方语言(德语、法语、意大利语、罗曼什语)的实证框架。研究通过对比9个主流大模型在三种信息条件下的投票行为与实际公投结果及政党建议(Parolen),发现:(1)抽象问卷中表现出的左倾倾向在具体政策决策中转变为以中间派(Die Mitte、FDP)为中心的分布,表明先前的“左倾偏见”无法外推至现实决策;(2)部分模型的回答受语言表述影响程度甚至超过政治内容本身,跨语言一致性差异显著(50%–98%),揭示语言敏感性对模型输出的重要干扰;(3)两个模型呈现系统性反变革倾向,在所有公投中均倾向于否决(Nein),无论议题方向如何,其行为更符合谨慎官僚而非意识形态盟友。因此,论文指出,以往通过抽象问卷所识别的“左倾偏见”可能并不反映模型在真实政策情境中的行为本质,反而揭示出模型在具体决策中表现出中心化、现状偏好和语言依赖等新特征。
链接: https://arxiv.org/abs/2606.00048
作者: Joel Barmettler
机构: Independent Researcher(独立研究员); Zurich(苏黎世); Switzerland(瑞士)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 13 pages, 10 figures. Preprint. Code and data: this https URL
Abstract:Prior research has established that instruction-tuned large language models exhibit left-of-center political bias, measured exclusively through abstract political questionnaires. We show that this finding does not generalize to concrete policy decisions. We introduce a dual-instrument methodology grounded in Swiss democratic reality. The Smartvote questionnaire (75 abstract policy questions) is administered to 66 LLMs from 27 model families and compared to 184 elected members of the Swiss National Council, replicating the established leftward convergence (Cohen’s d = 3.64, p = 0.0002). Then, novel to this work, 9 flagship LLMs are confronted with 48 real federal referenda (Volksabstimmungen) in four national languages (German, French, Italian, Romansh) under three information conditions, comparing votes to actual outcomes and party recommendations (Parolen). Three findings challenge the prevailing narrative. (1) Abstract questionnaires do not predict concrete behavior: the left-to-right agreement gradient on Smartvote shifts from left-peaked to center-peaked on Volksabstimmungen, where models align most with centrist Die Mitte and FDP rather than leftist SP and Gruene (Wilcoxon p = 0.008). (2) For some models, the language of a political question changes the answer more than the political content does: cross-linguistic consistency ranges from 50% (Mistral) to 98% (GPT-5.4). (3) Two models exhibit systematic change-aversion rather than political bias, voting Nein on 83-94% of referenda regardless of direction (binomial p 0.0001). What prior work measured as “leftward bias” may not generalize beyond abstract instruments. On concrete policy decisions, LLMs behave less like coalition partners of the left and more like cautious civil servants: centrist, status-quo-favoring, and inconsistent across languages.
[NLP-233] LLM s for Cardiovascular Risk Prediction from Structured Clinical Data
【速读】: 该论文旨在解决冠状动脉疾病(CAD)早期预测中如何有效融合结构化临床数据与自然语言医学信息的问题,以提升预测系统的准确性与临床实用性。其核心挑战在于传统机器学习模型虽在结构化数据上表现良好,但难以直接处理医生记录中的非结构化文本信息,而大语言模型(LLM)虽具备理解自然语言的能力,却面临数据隐私与可解释性问题。解决方案的关键在于构建一个混合框架:将1,190例患者记录中的11个结构化临床变量通过大语言模型(LLM)转化为可解释的特征表示及合成临床叙事文本,并通过反向提取验证机制评估生成内容与原始数据的一致性,实现平均94.61%的保真度。在此基础上,对比四种传统机器学习模型与基于GPT和Gemini的零样本及少样本提示分类性能,发现随机森林(Random Forest)在准确率上最优,但LLM-based分类在真实临床场景中仍具优势——因其可直接处理自然语言描述,无需暴露敏感数值型数据(如实验室值、血压读数、诊断编码),从而在保障数据隐私的前提下实现高效、可扩展的辅助诊断支持。研究结果表明,结合结构化数据与LLM生成的叙事文本,为构建新一代隐私友好型、可解释的混合临床预测系统提供了可行路径。
链接: https://arxiv.org/abs/2606.00031
作者: Jeba Maliha,Md Rafiul Kabir
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: International Conference on Intelligent Systems, Blockchain, and Communication Technologies
Abstract:Coronary artery disease (CAD) remains one of the leading causes of death globally, highlighting the need for reliable predictive systems to support early diagnosis and risk assessment. While traditional machine learning models perform well on structured clinical data, large language models (LLMs) present new possibilities to interpret medical information expressed in natural language. In this work, we develop a hybrid framework that bridges structured clinical data and natural-language representations for CAD prediction. Using a publicly available dataset of 1,190 patient records with 11 clinical attributes, structured variables are converted into interpretable feature representations and synthetic clinical narratives using LLMs. A validation pipeline performs reverse extraction of clinical variables and computes a consistency score with the original records, achieving an average fidelity of 94.61%. We then evaluate four conventional machine learning models and compare their performance with LLM-based classification under zero-shot and few-shot prompting settings. We use two LLMs here, GPT and Gemini. Experimental results show that Random Forest achieves the highest accuracy. Despite this advantage, LLM-based classification remains beneficial in real-world clinical settings. This is because LLMs operate directly on natural language patient descriptions, meaning that sensitive numerical patient data such as exact lab values, blood pressure readings, and diagnostic codes are kept private. Findings suggest that combining structured clinical data with LLM-generated narratives can enable new directions for hybrid clinical prediction systems.
[NLP-234] CAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation
【速读】: 该论文旨在解决检索增强生成系统在处理历史刑事案件叙述中的复杂问题时,因缺乏时间推理能力与证据融合机制而导致的答不准、逻辑不连贯的问题。现有方法或未能根据查询语义动态调整检索策略,或无法有效整合多源证据形成一致推理链。其解决方案的关键在于提出一种名为“时间上下文增强的检索生成”(Temporal Context Augmented Retrieval Generation, TCAR-Gen)的框架,该框架通过查询条件化的图神经网络实现语义感知的检索,引入时间证据融合机制以建模事件的时间演进关系,并采用树状链式推理结构支持多分支证据的协同分析。实验结果表明,TCAR-Gen在维多利亚犯罪日记(Victorian Crime Diaries)基准上取得了0.3738的Recall@5,显著优于基线模型,且消融实验证明上下文图结构、时间惩罚机制与查询条件化是核心组件。跨模型评估进一步揭示,尽管小规模语言模型在生成质量上表现下降,但TCAR-Gen仍能保持较强的检索覆盖能力,验证了显式时间建模与多路径证据融合对于知识驱动型复杂问答任务的重要性。
链接: https://arxiv.org/abs/2606.00029
作者: Sidra Nasir,Muhammad Noman Zahid,Rizwan Ahmed Khan
机构: University of Verona (维罗纳大学); University of Camerino (卡梅里诺大学); Institute of Business Administration (IBA), Karachi (卡拉奇工商学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Retrieval-augmented generation systems struggle with temporal reasoning and evidence fusion when answering complex questions over historical criminal case narratives. Existing approaches either retrieve independently of query semantics or fail to integrate multiple evidence sources coherently. We propose Temporal Context Augmented Retrieval Generation (TCAR-Gen), a framework that combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning to ground answer generation in retrieved evidence. On the Victorian Crime Diaries benchmark, TCAR-Gen achieves 0.3738 Recall@5, outperforming Vanilla RAG, Temporal RAG, GraphRAG-C, and GraphRAG-T across seven query types including multi-hop reasoning and counterfactual questions. Ablation studies reveal that the context graph, temporal penalty mechanism, and query conditioning are critical components. Cross-model evaluation across five language model (GPT-OSS 20B to TinyLlama 1.1B) demonstrates that TCAR-Gen maintains robust retrieval coverage at smaller model scales, though generation quality degrades substantially with reduced model capacity. Our work shows that explicit temporal modelling and multi-branch evidence fusion are essential for faithful, reasoning-intensive question answering over knowledge-grounded corpora.
[NLP-235] A Multi-Domain Red Teaming Framework for Safety Robustness and Fairness Evaluation of Medical Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在医疗领域部署中,现有评估基准无法有效捕捉模型在对抗性或伦理复杂情境下的行为表现这一关键问题。其核心挑战在于,传统以平均准确率为单一指标的评估方法难以揭示模型在临床实践中可能发生的严重安全风险,尤其在面对真实世界中常见的边缘案例或敏感情境时。论文提出了一种多领域红队测试框架(multi-domain red teaming framework),通过构建690个基于临床实际场景的测试用例,覆盖九个医学领域及超过150个子类别,引入对抗性变换以模拟真实临床威胁,并采用包含七维度评分标准的混合评估体系,结合生成式AI辅助打分与临床专家人工验证。研究发现,尽管部分高性能模型(如X-BAI、GPT-5、Claude Opus 4.1)整体表现优异(平均得分高于0.97),但其在个别关键安全场景中仍出现完全失效的情况,表明平均性能指标会掩盖潜在的重大临床风险。此外,涉及公平性任务时,人口统计学特征的微小变化可导致错误率上升10%-20%。研究强调,仅依赖均值准确率不足以评估模型的临床可靠性,而应关注性能方差和最坏情况下的失败模式;同时,融合自动化评估与临床医生介入的混合评估范式是实现可信安全评估的关键所在。
链接: https://arxiv.org/abs/2606.00027
作者: Andrei Marian Feier,Veysel Kocaman,Yigit Gul,Ahmet Korkmaz,Alexander Thomas,Aleksei Zakharov,Jay Gil,Mehmet Butgul,David Talby
机构: John Snow Labs Inc.
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures. To be presented at the Text2Story 2026 Workshop (Delft, The Netherlands, 29 March 2026); CEUR Workshop Proceedings (forthcoming). Affiliation: John Snow Labs Inc
Abstract:Large language models (LLMs) are increasingly deployed across healthcare, yet existing benchmarks fail to capture model behavior under adversarial or ethically complex conditions common in clinical practice. We developed a multi-domain red teaming framework evaluating eleven contemporary LLMs across 690 clinically grounded scenarios spanning nine domains and over 150 subcategories. Scenarios incorporated adversarial transformations, and responses were assessed using a seven-dimension rubric with LLM-assisted scoring and human-in-the-loop validation. Results revealed substantial performance variance, with mean scores ranging from 0.791 to 0.984. Critically, several high-performing systems produced complete failures in individual safety-critical scenarios, demonstrating that aggregate accuracy masks clinically meaningful risk. The highest-performing systems (X-BAI, GPT-5, Claude Opus 4.1) achieved scores above 0.97 with low variance, while performance varied significantly across domains. Equity-related tasks showed 10-20% error amplification with demographic modifications, and human reviewers identified clinically relevant failures missed by automated evaluation. Our findings demonstrate that performance variance and worst-case failures provide more clinically meaningful reliability indicators than mean accuracy alone, and that hybrid evaluation approaches combining automation with clinician oversight are essential for credible safety assessment.
[NLP-236] Cognitive-Linguistic Indicators of Depression in Online Communities: Analysed by DistilBERT and Holographic Reduced Representation
【速读】: 该论文旨在解决在线文本中抑郁情绪自动检测的准确性问题,尤其关注如何有效融合认知心理学理论指导的语言特征与深度学习嵌入表示以提升分类性能。其核心挑战在于传统自然语言处理方法难以捕捉抑郁个体在语言中体现的深层认知模式。解决方案的关键在于将贝克抑郁认知理论(Beck’s Cognitive Theory of Depression)所定义的认知扭曲(cognitive distortions)转化为可量化的语言特征,包括第一人称代词密度、绝对化词汇使用频率及负面情绪表达强度,并将其与基于DistilBERT的上下文语义嵌入进行融合。具体而言,采用全息还原表示(Holographic Reduced Representation, HRR)对这些认知-语言特征进行编码,再与DistilBERT句向量拼接,最终通过逻辑回归实现分类。实验结果表明,该混合模型在多个评估指标上显著优于传统的TF-IDF+朴素贝叶斯基线模型,宏平均F1得分从0.80提升至0.94,5折交叉验证F1由0.83增至0.92,AUC从0.958提高到0.981,验证了结合认知理论驱动的语言特征与预训练模型嵌入的有效性。
链接: https://arxiv.org/abs/2606.00026
作者: Brian Van Steen
机构: University of Leeds (利兹大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper investigates whether combining cognitively grounded linguistic features with transformer-based embeddings improves automated detection of depression in online text. Using Beck’s Cognitive Theory of Depression, the study extracts cognitive distortions as measurable features, including first-person pronoun density, absolutist words, and negative emotion in Reddit posts from depression-related and control communities. Using a subset of the Kaggle Reddit Suicide and Depression Detection dataset, two classification pipelines are compared, a TF-IDF embedding with Naive Bayes as a baseline, and a hybrid model that concatenates DistilBERT sentence embeddings with Holographic Reduced Representation (HRR) vectors encoding the cognitive-linguistic features, followed by Logistic Regression. The hybrid DistilBERT HRR model achieves a macro F1 score of 0.94 versus 0.80 for the TD-IDF baseline, with 5-fold cross validation F1 improving from 0.83 to 0.92, and AUC from 0.958 to 0.981.
[NLP-237] ART: Attention Run-time Termination for Efficient Large Language Model Decoding
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本生成过程中因需频繁访问庞大的键值(Key-Value, KV)缓存而导致的内存带宽瓶颈问题。现有大多数KV管理方法依赖于仅基于键(key-only pruning)的剪枝策略,但忽略了值(value)在注意力计算中的协同作用;尽管引入值信息可提升注意力精度,却会带来难以承受的额外开销。为此,本文提出一种轻量级运行时机制——注意力运行时终止(Attention Run-time Termination, ART),其核心在于在核函数执行过程中实时追踪累积的注意力输出,并在检测到后续KV块访问对结果贡献趋于可忽略时提前终止访问。该设计与现有的基于键的KV缓存管理方法正交,可无缝集成以增强性能。在LongBench基准测试中,ART在大批量场景下实现了比当前最优基线20%更高的生成吞吐量,同时保持相近的生成准确率。
链接: https://arxiv.org/abs/2606.00024
作者: Chen Qiu,Guozhong Li,Panos Kalnis
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Long-context decoding in Large Language Models (LLMs) is severely constrained by the memory bandwidth required to fetch the extensive Key-Value (KV) cache. Most existing KV management methods rely on key-only pruning before decoding, despite the evidence that attention outputs depend jointly on keys and values, as incorporating values in their methods incurs prohibitive additional overhead. In this paper, we propose Attention Run-time Termination (ART), a lightweight run-time mechanism that tracks accumulated attention outputs during kernel execution and terminates subsequent KV block accesses once further contributions become negligible. This design makes ART orthogonal to existing key-based KV cache management methods, enabling seamless integration with them. Experiments on LongBench benchmarks show that ART achieves 20% higher generation throughput in large batch size than state-of-the-art baseline while maintaining comparable accuracy.
[NLP-238] rustLDM: Benchmarking Trustworthiness in Language Diffusion Models
【速读】: 该论文旨在解决生成式语言扩散模型(Language Diffusion Models, LDMs)在实际应用中面临的可信性(trustworthiness)风险问题,尤其关注其在存在恶意后置上下文(post contexts)时的安全性、隐私性和公平性退化现象。尽管LDMs在仅依赖用户提示(prompt)时表现出较强的对齐能力,但当攻击性或误导性后置上下文被附加至掩码响应中时,其可信性显著下降。论文提出的关键解决方案是构建一个面向LDMs的综合性可信性评估基准TrustLDM,系统评估多种架构在不同静态后置上下文下的表现,并进一步设计TrustLDM-Auto——一种利用LDM解码灵活性的自动化评估框架,通过系统性探索解码顺序与生成长度等配置变量,精准识别出各类模型在多个维度上的可信性薄弱环节。该方法揭示了当前主流LDMs普遍存在的深层可信性缺陷,为构建更可靠的下一代生成式AI系统提供了关键工具与方向。
链接: https://arxiv.org/abs/2606.00023
作者: Yichuan Mo,Yukun Jiang,Yanbo Shi,Mingjie Li,Michael Backes,Yang Zhang,Yisen Wang
机构: Peking University (北京大学); CISPA Helmholtz Center for Information Security (CISPA亥姆霍兹信息安全中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at this https URL.
[NLP-239] mfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation SEMEVAL2026
【速读】: 该论文旨在解决生成式幽默(Humor Generation)中的核心挑战:幽默具有高度主观性,其有效性依赖于受众、语境与文化背景,导致人工标注的“好笑”标准噪声大、一致性低。传统方法依赖绝对评分进行监督,难以捕捉这种相对偏好。为此,论文提出一种“生成大量—优选最佳”(generate-many - select-best)的解决方案,其关键在于构建一个基于人类偏好对比的偏好模型(Preference Model),通过学习成对的人类比较判断而非绝对评分来模拟读者的偏好。该模型利用作者自建的2.5K条基于“幽默竞技场”原型收集的成对判断数据进行训练,并设计可解释的流水线将标注结果转化为偏好模型。实验表明,该方法在多个数据集上均优于基线模型,且具备更强的跨领域泛化能力。最终,该系统在SemEval-2026任务(MWAHAHA)中分别以第一名(英语和中文子任务)和第二名(西班牙语子任务)的成绩胜出,同时公开了候选生成池与排序结果等中间产物,推动后续研究发展。
链接: https://arxiv.org/abs/2606.00022
作者: Alexey Tikhonov,Alexey Ivanov
机构: Inworld.AI(在世界AI); OpenAI(开放人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages. Accepted for SEMEVAL 2026
Abstract:Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because “funny” is audience-dependent and supervision is noisy – preferences vary with audience, context, and culture, and annotator agreement is often low. In this paper, we describe our system for the SemEval-2026 Task-1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons. We adopt a “generate-many - select-best” strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a “reader” by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask. Comments: 5 pages. Accepted for SEMEVAL 2026 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) MSC classes: 68T50, 68T05 ACMclasses: I.2.7; I.2.6 Cite as: arXiv:2606.00022 [cs.CL] (or arXiv:2606.00022v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.00022 Focus to learn more arXiv-issued DOI via DataCite
[NLP-240] SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding
【速读】: 该论文旨在解决基于检索的推测解码(Retrieval-based Speculative Decoding, RSD)在实际应用中因依赖字面匹配而导致的鲁棒性不足问题,即在面对输入表面形式的微小变化时,检索与验证过程易失效。其核心解决方案是提出SENSE(Semantic Embedding Navigation with Soft-gated Evaluation),通过将检索过程锚定于目标模型的隐藏状态(hidden states),建立更稳健的语义对齐机制,并引入软门控评估(Soft-gated Evaluation)模块,实现对语义等价性的判断而非表层形式的匹配。该方法显著提升了检索-验证流程在多样化输入下的稳定性与有效性。为确保可比性,研究还构建了一个统一框架,将现有方法分解为原子级组件,支持细粒度的对比分析。实验结果表明,SENSE在LLaMA和Qwen系列模型上均取得显著性能提升,最大接受长度达4.09,推理速度最高提升3.26倍,同时保持生成质量不变。
链接: https://arxiv.org/abs/2606.00021
作者: Shaowen Chen,Zhicheng Liao,Hongwei Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.
[NLP-241] CSRP: Chain-of-Thought Reasoning for Chinese Text Correction via Reinforcement Learning with Efficiency-Aware Rewards ACL2026
【速读】: 该论文旨在解决基于大语言模型(Large Language Model, LLM)的中文语法错误修正(Chinese Grammatical Error Correction, CGEC)系统中存在的两大核心问题:通用预训练模型缺乏针对细微语法差异的领域语言先验知识,以及采用最大似然估计(Maximum Likelihood Estimation, MLE)进行监督微调(Supervised Fine-Tuning, SFT)时无法优化以精确率为导向的评估指标,导致系统存在系统性过度修正(over-correction)问题。为此,论文提出一种三阶段框架CSRP(Continual Pre-training + Chain-of-Thought SFT + Group Relative Policy Optimization),其关键在于通过590万条均衡样本的持续预训练(Continual Pre-training, CPT)内化领域知识,引入基于思维链(Chain-of-Thought)的SFT以增强错误诊断的可解释性,并设计一种新型效率感知奖励函数(Efficiency-Aware Reward)的组相对策略优化(Group Relative Policy Optimization),显式惩罚不必要的修改。实验表明,该方法在NACGEC基准上达到50.99的F₀.₅和57.17的精确率,显著优于此前最优结果,同时有效缓解了MLE训练模型的过度修正偏差;此外,在拼写纠错(CSCD)任务上也达到59.61的F1,超越GPT-4达5.20个百分点。消融实验证明,强化学习对齐阶段相较SFT基线带来8%的相对性能提升,且该增益与大规模持续预训练的贡献正交,验证了对编辑效率进行显式优化对于高质量语法纠错的关键作用。
链接: https://arxiv.org/abs/2606.00020
作者: Wei Tian,Yuhao Zhou,Man Lan
机构: East China Normal University (华东师范大学); Shanghai Institute of Artificial Intelligence for Education (上海人工智能教育研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026, Main conference)
Abstract:Large Language Model (LLM) based Chinese Grammatical Error Correction (CGEC) systems face two critical challenges: general-purpose models lack specialized linguistic priors for subtle grammatical distinctions, and Supervised Fine-Tuning (SFT) with Maximum Likelihood Estimation fails to optimize for precision-focused metrics, leading to systematic over-correction. We propose CSRP, a three-stage framework that progressively builds correction capability through Continual Pre-training (CPT) on 5.9M balanced samples to internalize domain knowledge, Chain-of-Thought SFT with explicit error reasoning for diagnostic transparency, and Group Relative Policy Optimization with a novel Efficiency-Aware Reward that explicitly penalizes unnecessary edits. On the NACGEC benchmark, CSRP achieves state-of-the-art performance with 50.99 F_0.5 and 57.17 precision, substantially outperforming previous best results while effectively mitigating the over-correction bias inherent in MLE-trained models. Our method also advances CSCD spelling correction to 59.61 F1, surpassing GPT-4 by 5.20 points. Comprehensive ablation studies demonstrate that the RL alignment stage contributes a 8% relative gain over the SFT baseline, and that this gain is orthogonal to the contribution of large-scale CPT, validating that explicit optimization for edit efficiency is essential for high-quality grammatical error correction. Our code is available at this https URL.
[NLP-242] AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection
【速读】: 该论文旨在解决当前生成式AI(Generative AI)文本检测面临的挑战,即随着语言模型日益接近人类语言流畅度,依赖表面统计特征或似然性信号的检测方法逐渐失效。其核心解决方案是提出一种基于归因(attribution-driven)的人类与AI文本作者身份识别方法——\textscAEyeDE,该方法利用模型注意力机制(attention mechanism)作为判别性信号。关键在于通过具有白盒访问权限的代理Transformer模型,提取人类与AI生成文本的注意力归因矩阵,并训练一个轻量级卷积神经网络(Convolutional Neural Network, CNN)从这些归因图中学习区分性表征。实验表明,在编码器-解码器翻译场景下,该方法显著优于仅依赖文本内容的基线模型;在解码器-only设置中,其在特定生成器检测任务中表现优异,且在标准基准测试中保持竞争力,同时对跨数据集迁移和变体拼写扰动展现出强鲁棒性。进一步分析发现,注意力图中存在重复出现的局部结构模式,其相对频率在不同数据集和代理模型间均表现出人类与AI生成文本的一致性差异,表明注意力归因图可提供一种互补且可解释的检测信号。
链接: https://arxiv.org/abs/2606.00016
作者: Aria Nourbakhsh,Adelaide Danilov,Christoph Schommer,Salima Lamsiyah
机构: University of Luxembourg (卢森堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages, 2 figures
Abstract:Detecting AI-generated text is becoming increasingly challenging as modern language models approach human-level fluency and can evade detectors that rely on surface statistics or likelihood-based signals. We propose \textscAEyeDE, an attribution-driven approach to human-AI authorship detection that leverages model attention as a discriminative signal. Specifically, we extract attention-based attribution matrices for both human- and AI-generated text using a \emphproxy Transformer model with white-box access and train a lightweight Convolutional Neural Network to learn representations from these attribution maps. Across encoder-decoder translation settings, our method consistently outperforms a text-only baseline. In decoder-only settings, it performs strongly in generator-specific detection, remains competitive on standard benchmarks, and shows robustness under cross-dataset transfer and alternative-spelling perturbations. We further show that attention maps exhibit recurring local structures whose relative frequencies differ consistently between human- and AI-generated text across datasets and proxy models. These findings suggest that attention-based attribution maps provide a complementary and interpretable signal for AI-generated text detection. We will make the code publicly available to support future research.
[NLP-243] oward Robust In-Context Learning: Leverag ing Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分布外(Out-of-Distribution, OOD)场景下性能随分布偏移加剧而下降的问题。其核心挑战在于:当目标域不可访问时,难以评估未知分布,进而影响从源域中检索到的示范样本(demonstrations)的质量。为此,论文提出DOPA框架,其关键创新在于引入一个分布外代理(OOD proxy)以近似不可访问的目标域,并指导示范样本的检索过程;同时,DOPA进一步设计基于马哈拉诺比斯距离(Mahalanobis distance)的全局多样性约束,确保所选示范在语义空间中具有充分的多样性,从而提升模型在严重分布偏移下的鲁棒性。实验结果表明,该方法在多个大语言模型和任务上均能有效增强模型的OOD泛化能力。
链接: https://arxiv.org/abs/2606.00014
作者: Hao Xu,Rite Bo,Fausto Giunchiglia,Yingji Li,Rui Song
机构: Jilin University (吉林大学); University of Trento (特伦托大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by ACL 2026 main
Abstract:Although studies have demonstrated that Large Language Models (LLMs) can perform well on Out-of-Distribution (OOD) tasks, their advantage tends to diminish as the distribution shift becomes more severe. Consequently, researchers aim to retrieve distributionally similar and informative demonstrations from the available source domain to boost the inference capabilities of LLMs. However, in practical scenarios where the target domain is inaccessible, evaluating the unknown distribution is challenging, which indirectly impacts the quality of the selected demonstrations. To address this problem, we propose \textbfDOPA, a demonstration search framework that incorporates an OOD proxy to approximate the inaccessible target domain and guide the retrieval process. Building on proxy-based evaluation, DOPA further introduces a Mahalanobis distance-based global diversity constraint to ensure sufficient diversity among the retrieved demonstrations. Experimental results on multiple LLMs and tasks demonstrate that DOPA effectively enhances robustness in OOD settings\footnotethis https URL_code.
[NLP-244] DraDDP: A Multimodal Multi-Party Dialogue Discourse Parsing Dataset
【速读】: 该论文旨在解决多参与者对话中话语结构与语义关系识别的难题,尤其针对现有研究在模态(仅限文本)和对话参与方数量(仅限双人对话)上的局限性。其核心挑战在于如何有效融合多模态信息(如语音、视觉等)以准确解析复杂多参与者场景下的对话依赖结构与关系类型。解决方案的关键在于构建首个公开可用的英文多模态多参与者对话话语解析数据集DraDDP,该数据集基于美国电视剧,包含495段对话片段、6,374条语句及9.1小时的同步视频内容,全面覆盖丰富的多主体互动情境,并在此基础上建立系统性评估基准,通过深入分析不同模态对任务性能的影响,验证了多模态信息在捕捉对话结构与关系类型中的关键作用。
链接: https://arxiv.org/abs/2606.00012
作者: Shannan Liu,Peifeng Li,Yaxin Fan,Qiaoming Zhu
机构: Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-party dialogue discourse parsing aims to identify dependency structures and relation types between utterances in conversations. Previous studies are mostly limited to textual modality or two-party dialogue, failing to meet the multimodal and multi-party settings. In this paper, we construct the first publicly available English multimodal dataset DraDDP for multi-party dialogue discourse parsing, based on American TV dramas. DraDDP contains 495 dialogue segments with 6,374 utterances and 9.1 hours of parallel video content, covering rich multi-party interaction scenarios. Moreover, we establish comprehensive benchmarks by evaluating this task on DraDDP and conducting in-depth analysis on the impact of different modalities. Experimental results demonstrate the value of multimodal information in capturing dialogue structures and relation types. We will publicly release the dataset, annotation guidelines, and code to promote future research in multimodal dialogue understanding.
[NLP-245] BenHalluEval: A Multi-Task Hallucination Evaluation Framework for Large Language Models on Bengali
【速读】: 该论文旨在解决低资源语言 Bengali 在大语言模型(LLM)中缺乏系统性幻觉评估的问题。尽管 Bengali 是全球第六大使用语言,但此前尚无针对其幻觉现象的全面评测框架。为此,作者提出了 BenHalluEval,一个面向 Bengali 的细粒度幻觉评估框架,涵盖生成式问答(Generative Question Answering, GQA)、孟加拉-英语混用问答、摘要生成和推理四个任务。其关键解决方案在于设计了一套双轨评估协议(dual-track protocol):Track A 量化在真实答案样本上的假阳性率(false-positive rate),Track B 评估对人工构造的 12,000 个幻觉候选样本的检测率。为克服单一指标偏差并联合惩罚两类错误,提出 BenHalluScore 作为双轨校准评分指标,有效揭示了不同模型与任务间显著的幻觉校准差异(7.72%–55.42%)。研究还发现,链式思维提示(Chain-of-thought prompting)虽能改变响应分布,但并未一致提升幻觉识别能力。该工作首次建立了 Bengali 专属的幻觉基准,强调了在低资源语言场景下,仅依赖单轨评估或提示工程的局限性,推动了更严谨的多维度评估范式发展。
链接: https://arxiv.org/abs/2605.31483
作者: Shefayat E Shams Adib,Ahmed Alfey Sani,Ekramul Alam Esham,Ajwad Abrar,Ishmam Tashdeed,Md Taukir Azam Chowdhury
机构: Islamic University of Technology (伊斯兰大学技术学院); University of California (加州大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint. Under review
Abstract:Despite Bengali being the sixth most spoken language in the world, no prior work has systematically evaluated hallucination in large language models (LLMs) for Bengali. We introduce BenHalluEval, a fine-grained hallucination evaluation framework for Bengali covering four tasks: Generative Question Answering (GQA), Bangla-English Code-Mixed QA, Summarization, and Reasoning. We construct 12,000 hallucinated candidates using GPT-5.4 across twelve task-specific hallucination types, drawn from three existing Bengali datasets, and evaluate seven LLMs spanning reasoning-oriented, multilingual, and Bengali-centric categories under a dual-track protocol that independently measures false-positive rate on ground-truth instances (Track A) and hallucination detection rate on hallucinated candidates (Track B). To jointly penalise both failure modes and prevent inflated scores from uniform response bias, we propose BenHalluScore, a dual-track calibration metric that ranges from 7.72% to 55.42% across models and tasks, revealing substantial variation in hallucination calibration. Chain-of-thought prompting, applied as a mitigation strategy, shifts response distributions without consistently improving hallucination discrimination. BenHalluEval establishes the first dedicated hallucination benchmark for Bengali and highlights the inadequacy of single-track and prompting-only evaluation approaches for low-resource language settings. The dataset and code are available at this https URL.
[NLP-246] Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在系统性推理中缺乏可追溯证据支撑的“认知鸿沟”问题,即模型虽能生成流畅文本却常出现自信但无根据的幻觉性断言。其核心解决方案是引入普拉曼纳(Pramana)框架,通过在2500年历史的印度哲学逻辑体系——那瓦-尼雅亚(Navya-Nyaya)的基础上对LLM进行微调,赋予模型显式的认识论方法论。该框架强制执行六阶段结构化推理流程:怀疑分析(SAMSHAYA)、证据源识别(PRAMANA)、五成分三段论(PANCHA AVAYAVA)、反事实验证(TARKA)、谬误检测(HETVABHASA)与确证判断(NIRNAYA),从而提供标准链式思维提示之外的认知支架。实验表明,在仅40%严格格式遵循的情况下,模型在保留测试集上仍实现100%语义正确率,说明其已内化推理内容;消融实验进一步揭示格式提示与温度设置对性能的关键影响,且最优配置随训练阶段而异。研究团队已将所有模型、数据集与训练基础设施开源至Hugging Face,以推动面向人工智能推理的可解释性与可信性研究。
链接: https://arxiv.org/abs/2604.04937
作者: Sharath Sathish
机构: University of York (约克大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 52 pages + appendices, comprehensive treatment of Navya-Nyaya computational formalization
Abstract:Large language models produce fluent text but struggle with systematic reasoning, often hallucinating confident but unfounded claims. When Apple researchers added irrelevant context to mathematical problems, LLM performance degraded by 65% Apple Machine Learning Research, exposing brittle pattern-matching beneath apparent reasoning. This epistemic gap, the inability to ground claims in traceable evidence, limits AI reliability in domains requiring justification. We introduce Pramana, a novel approach that teaches LLMs explicit epistemological methodology by fine-tuning on Navya-Nyaya logic, a 2,500-year-old Indian reasoning framework. Unlike generic chain-of-thought prompting, Navya-Nyaya enforces structured 6-phase reasoning: SAMSHAYA (doubt analysis), PRAMANA (evidence source identification), PANCHA AVAYAVA (5-member syllogism with universal rules), TARKA (counterfactual verification), HETVABHASA (fallacy detection), and NIRNAYA (ascertainment distinguishing knowledge from hypothesis). This integration of logic and epistemology provides cognitive scaffolding absent from standard reasoning approaches. We fine-tune Llama 3.2-3B and DeepSeek-R1-Distill-Llama-8B on 55 Nyaya-structured logical problems (constraint satisfaction, Boolean SAT, multi-step deduction). Stage 1 achieves 100% semantic correctness on held-out evaluation despite only 40% strict format adherence revealing that models internalize reasoning content even when structural enforcement is imperfect. Ablation studies show format prompting and temperature critically affect performance, with optimal configurations differing by stage. We release all models, datasets, and training infrastructure on Hugging Face to enable further research on epistemic frameworks for AI reasoning.
[NLP-247] Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection
【速读】: 该论文旨在解决高维数据空间中目标观测样本的分布外(out-of-distribution, OOD)检测问题,尤其关注嵌入在低维子空间中的表示。其核心挑战在于:尽管生成式模型(如连续归一化流,CNF)在建模数据分布方面表现优异,但普遍存在“似然悖论”——即对分布外样本错误地赋予高似然值,这源于深度生成模型(DGMs)固有的归纳偏置:过度关注低层结构细节而忽视高层语义一致性。为此,论文提出一种拉格朗日子流(Lagrangian sub-flow, LSF)框架,通过将表示分解为与任务相关的子空间成分和上下文成分,实现对相关特征密度的有效估计。关键解决方案在于利用子流轨迹上的速度场构建一系列几何诊断信号,这些信号能够捕捉数据流形上的非平凡动态特性,从而揭示潜在的分布外异常。基于这些信号,设计出适用于零样本语音音素级发音错误检测的新型度量指标,并在真实世界发音错误检测基准上验证了其显著优于传统基于似然的方法。
链接: https://arxiv.org/abs/2606.00684
作者: Xinwei Cao,Mengxuan Lu,Torbjørn Svendsen,Giampiero Salvi
机构: Norwegian University of Science and Technology (挪威科技大学); Tsinghua University (清华大学)
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 16 pages, 5 figures
Abstract:We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the “likelihood paradox”, where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.
信息检索
[IR-0] Dynamic Spectral Denoising with Global-Context Attention for Multi-Behavior Recommendation
链接: https://arxiv.org/abs/2606.02417
作者: Miaomiao Cai,Yunshan Ma,Fangqi Zhu,Junfeng Fang,Zhijie Zhang,Zhiyong Cheng,Xiang Wang,See-Kiong Ng
类目: Information Retrieval (cs.IR)
备注:
Abstract:Multi-behavior recommendation improves target-behavior prediction by exploiting heterogeneous auxiliary feedback (e.g., view, collect, and cart), yet its robustness is undermined by behavior-dependent noise and inconsistency. We argue that the key bottleneck is a representation-level failure caused by two coupled heterogeneities. First, intra-behavior representation entanglement arises when multi-hop propagation blends incidental signals with true preferences in the embedding space, making coarse spatial denoising unable to suppress noise without sacrificing informative niche signals. Second, inter-behavior reliability heterogeneity complicates cross-behavior fusion because the predictive value of auxiliary behaviors varies across users and contexts. Without reliability calibration, frequent yet unreliable signals may dominate aggregation and cause target-intent drift. To address this bottleneck, we propose Dynamic Spectral Denoising with Global-Context Attention for Multi-Behavior Recommendation (SpectraMB), a target-oriented model that performs representation purification before reliability-aware fusion. SpectraMB introduces Dynamic Feature-Level Spectral Filtering, which re-parameterizes embeddings along the feature dimension into a feature-frequency space and learns view-adaptive spectral modulation under target supervision, enabling component-wise purification without hand-crafted frequency assumptions. It further proposes Global-Context Attention Fusion, which uses a purified global representation as a context anchor to assess view compatibility and perform reliability-aware aggregation, while a residual global backbone preserves collaborative structure. Extensive experiments on three real-world datasets show that SpectraMB achieves the best results in most evaluation settings and exhibits improved robustness under noisy interactions. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.02417 [cs.IR] (or arXiv:2606.02417v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.02417 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26), August 09–13, 2026, Jeju Island, Republic of Korea Related DOI: https://doi.org/10.1145/3770855.3818191 Focus to learn more DOI(s) linking to related resources
[IR-1] Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
链接: https://arxiv.org/abs/2606.02373
作者: Pengcheng Jiang,Zhiyi Shi,Kelly Hong,Xueqiang Xu,Jiashuo Sun,Jimeng Sun,Hammad Bashir,Jiawei Han
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints remain open, and which claims have actually been checked. We argue that this formulation puts too much routine state management inside the policy: reinforcement learning is forced to optimize both semantic search decisions and recoverable bookkeeping that the environment can maintain more reliably. We introduce Harness-1, a 20B search agent (retrieval subagent) trained with reinforcement learning inside a stateful search harness. The harness maintains environment-side working memory, including a candidate pool, an importance-tagged curated set, compact evidence links, verification records, compressed and deduplicated observations, and budget-aware context rendering. The policy retains the semantic decisions: what to search, which documents to keep or discard, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, Harness-1 achieves 0.730 average curated recall, outperforming the next strongest open search subagent by +11.4 points and remaining competitive with much larger frontier-model searchers. Its gains are especially strong on held-out transfer benchmarks, suggesting that reinforcement learning over explicit search state can produce retrieval behaviors that generalize beyond the training domains. Our code is available at this https URL.
[IR-2] Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis
链接: https://arxiv.org/abs/2606.02162
作者: Catyana Heyne,Jürgen Frikel,Filippo Riccio
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:Document type classification in visually rich documents remains challenging, as relevant information is distributed across textual, visual, and layout modalities. To capture this complexity, current approaches rely on diverse multimodal modeling strategies, resulting in heterogeneous architectures that complicate systematic comparison. This variability is also reflected in existing comparative studies, which often rely on heterogeneous evaluation setups, further complicating systematic comparison and making it difficult to assess progress. To address these limitations, this work provides a structured analysis of multimodal design strategies across transformer- and LLM-based architectures, combined with a controlled empirical comparison within a unified experimental framework. Specifically, four representative models (LayoutLMv3, Donut, Qwen3-VL-32B-Instruct, and Qwen3-32B) are evaluated on the RVL-CDIP benchmark to systematically analyze the contributions of text, image, and layout information for document type classification, with a particular focus on contrasting OCR-dependent and OCR-free approaches. The results show that specialized multimodal Transformers outperform LLM-based approaches on visually rich and layout-intensive documents. Image information contributes most strongly to reliable classification, while OCR-derived text provides useful but secondary support. These findings highlight that multimodal processing remains essential for documents with pronounced layout structure. Overall, the study provides a systematic basis for comparing multimodal architectures and offers practical guidance for selecting effective feature combinations and model designs for document type classification.
[IR-3] Rank-Constrained Deep Matrix Completion for Group Recommendation
链接: https://arxiv.org/abs/2606.01948
作者: Mubaraka Sani Ibrahim,Lehel Csató,Isah Charles Saidu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing popularity of group activities has increased the need for methods that provide recommendations to groups of users given their individual preferences. Many existing group recommender systems rely on aggregating individual user preferences, but they often struggle with high-dimensional and highly sparse rating data commonly found in real-world scenarios. We propose Group Rank-Constrained Deep Matrix Completion (Group RC-DMC), a novel framework that extends RC-DMC by integrating group-level representation learning via a Set-Transformer aggregator, jointly leveraging low-rank structure and attention-based nonlinear modeling. Unlike most existing group recommender systems, Group RC-DMC unifies explicit low-rank regularization, linear encoder-decoder architectures, and attention-based nonlinear group modeling within a single framework, yielding accurate predictions at both the individual and group levels. Group RC-DMC addresses data sparsity through low-rank matrix completion, computing per-user latent representations from observed ratings only, and enforcing a rank constraint on the latent space using a nuclear-norm proximal step based on periodic singular value thresholding. The decoder is parametrized as a low-rank factorization, enabling efficient inference. Experimental results on the MovieLens and Goodbooks datasets demonstrate that Group RC-DMC achieves superior reconstruction accuracy, measured by lower group RMSE, while remaining computationally efficient and competitive in group-level performance in terms of precision, recall, and F1 score compared with weighted-before-factorization (WBF) and after-factorization (AF) baselines. The results highlight the model’s ability to recover the underlying low-rank structure of user-item interactions and provide robust group recommendations across small, medium, and large user groups.
[IR-4] Decoupled Residual Quantization for Robust Semantic IDs in Recommendation
链接: https://arxiv.org/abs/2606.01844
作者: Xuesi Wang,Junjie Wang,Ziliang Wang,Weijie Bian,Guanxing Zhang
类目: Information Retrieval (cs.IR)
备注:
Abstract:Semantic IDs represent items as shared discrete token sequences and have become a practical tool for recommendation and retrieval. Yet it remains difficult to tell why a tokenizer fails: poor quality may come from codebook underutilization, unstable decision boundaries, or geometric distortion of the embedding space. This paper develops a quantitative framework for diagnosing these failures through expected codeword overlap and effective codebook capacity. The former measures expected codeword confusion under retrieval-time perturbation, while the latter converts that confusion into an effective number of usable, well-separated codes. The framework links semantic boundary confusion to both code usage imbalance and Euclidean geometric constraints. As a proof of concept, we present Decoupled Residual Quantization (DRQ), which separates continuous geometry reconstruction from discrete distribution matching. Experiments on a large-scale industrial dataset show that Semantic ID quality is multi-objective: symbolic robustness, reconstruction fidelity, and behavior-aware soft matching each stress different aspects of a tokenizer. These downstream observations are based on one proprietary industrial dataset, so they should be read as a case study rather than a universal benchmark claim.
[IR-5] Breaking the Information Silo: Semantic Personas for Cross-Domain Recommendation
链接: https://arxiv.org/abs/2606.01783
作者: Jonathan Mayo,Moshe Unger,Konstantin Bauman
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Digital platforms increasingly operate as isolated information silos, limiting their ability to construct comprehensive user representations across domains. Cross-domain recommender systems seek to overcome this limitation by transferring knowledge from a source domain to a target domain, yet most existing approaches depend on shared users, shared items, or structurally similar interaction graphs. These assumptions are often unrealistic across independent platforms. We propose SPHERE (Semantic Personas for Heterogeneous cross-domain Recommendation), a design artifact that enables recommendation knowledge transfer across strictly disjoint domains with no shared users or items. Rather than aligning domains through identity or graph structure, SPHERE uses large language models to induce a shared behavioral vocabulary, generate structured semantic personas for users, and retrieve behaviorally similar source-domain communities that form a Community Source Persona. This semantic signal is integrated with collaborative signals through a dual-tower architecture and dynamic fusion gate, allowing SPHERE to augment standard recommender backbones. Empirical evaluation across Amazon Books, Goodreads, and Steam demonstrates consistent improvements over NCF, SVD++, and LightGCN baselines under full-ranking evaluation. The results show that cross-domain transfer effectiveness is not determined solely by semantic proximity between domains; rather, it depends critically on the structural density and native predictive strength of the target domain. The study contributes to information systems research by reframing cross-domain personalization as behavior-based semantic alignment, offering a practical mechanism for overcoming information silos while preserving interpretability and modularity.
[IR-6] Whole-Pool Setwise Reranking with Long-Context Language Models
链接: https://arxiv.org/abs/2606.01782
作者: Hang Li,Chuting Yu,Teerapong Leelanupab,Bevan Koopman,Guido Zuccon
类目: Information Retrieval (cs.IR)
备注: 4 pages main content, 10 page Appendix
Abstract:Previous LLM-based passage re-rankers are often expensive and slow because the input context constraints require the LLM to make many dependent model calls. We study how recent long-context LLMs change this problem: when the full set of retrieved candidate passages can be shown to the model at once, ranking no longer has to be reconstructed from many overlapping local comparisons. We propose Whole-Pool Setwise re-ranking, where each call considers all currently unranked candidate passages, and introduce DualEnd, which identifies both the most and least relevant passages in one call. By filling the ranking from both ends, DualEnd ranks 100 candidates with 50 serial LLM calls, compared with 99 calls for comparable one-passage-at-a-time whole-pool methods. Experiments with nine open-weight LLMs on two passage re-ranking benchmarks, measuring effectiveness, call count, token use, runtime, and output reliability shows that long context is not merely more prompt space, but an opportunity to make LLM re-rankers both effective and efficient.
[IR-7] me-Aware Diffusion based on Preference Disentanglement for Generative Recommendation
链接: https://arxiv.org/abs/2606.01670
作者: Bangguo Zhu,Peng Huo,Yuanbo Zhao,Zhicheng Du,Jun Yin,Senzhang Wang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Generative Recommenders (GRs) have emerged as a transformative recommendation paradigm by replacing traditional item IDs with semantic indices (SIDs). Owing to the exceptional generative capabilities of diffusion models, a few pioneering works explore developing GRs with diffusion architectures as the backbone. However, a fatal limitation of existing diffusion-based GRs is that the diffusion process applies uniformly to all items within the historical interactions. In contrast, the user preference is shaped by multifaceted time-evolving factors and thus exhibits a non-stationary distribution in the temporal aspect. To bridge this gap, this study proposes a novel GR framework, named TDPM, by designing the time-aware diffusion on SID tokens. Specifically, TDPM explicitly integrates the impact of time-evolving user preferences into the diffusion process. In detail, the user preference is disentangled into (i) the period preference, which remains consistent over a long time-span, and (ii) the point preference, which is triggered by recent focal events. Extensive experiments on three public real-world datasets demonstrate the significant superiority of TDPM over the state-of-the-art baselines. TDPM achieves average improvements of up to 29.21% and 25.45% in terms of HR@20 and NDCG@20, respectively. The ablation study further underscores the necessity of time-aware token diffusion in diffusion-based GRs.
[IR-8] Self-Conditioned Positional HNSW for Overlap-Aware Retrieval in Chunked-Document RAG Systems: Method and Industrial Evidence-Quality Audit
链接: https://arxiv.org/abs/2606.01542
作者: Nataraj Agaram Sundar,Tejas Morabia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 11 pages, 5 figures, 4 tables
Abstract:Chunked-document retrieval is a common component of retrieval-augmented generation (RAG) systems. Documents are split into overlapping chunks, embedded, and indexed with approximate nearest-neighbor search such as hierarchical navigable small world graphs (HNSW). Overlap improves boundary coverage but induces a practical failure mode: top-k retrieval often returns near-adjacent chunks that repeat evidence and waste prompt budget. We propose Self-Conditioned Positional HNSW (SCP-HNSW), a lightweight modification that appends a low-dimensional positional code to chunk embeddings and uses a two-pass query procedure to estimate and apply a query-specific document-position prior. SCP-HNSW leaves HNSW graph construction and traversal unchanged while adding an auditable minimum-index-gap selector for final context construction. We also integrate industrial review artifacts for generated evidence quality: a 770-review text-evidence audit with 318 fully labeled reviews and a 70-case OCR audit with 350 ratings. The text audit shows that 574 of 770 projected reviews are rated 3/5, only 39 fall in the 1-2 range, and narrative reviewer detail appears much more often than structured issue flags. The OCR audit shows slice-level pass rates from 95% for clean chat screenshots to 45% for handwritten/blurry captures, with moderate to strong agreement. These results motivate overlap-aware, audit-friendly RAG retrieval and identify the remaining controlled retrieval ablations needed for causal performance claims.
[IR-9] Semantic Retrieval for Product Search in E-Commerce
链接: https://arxiv.org/abs/2606.01504
作者: Nikhil Kothari,Saksham Samdani,Ritam Mallick,Praveen Gupta,Ankit Vijay,Surender Kumar
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Semantic retrieval in e-commerce must handle short, noisy, and colloquial queries over large product catalogs with fine-grained attribute distinctions. We present a Siamese LLM dual-encoder trained through a two-stage pipeline: contrastive learning with a false-negative margin mask to prevent penalization of near-duplicate products, followed by Relative Odds Alignment for Retrieval (ROAR), a preference optimization objective that extends Bradley-Terry to variable-sized graded relevance groups via consecutive odds-ratio margins. The training corpus mirrors this progression - substitute query-product pairs provide coarse semantic supervision in Stage 1 and graded relevance annotations drive fine-grained ranking in Stage 2. The resulting system accurately retrieves exact matches while correctly ordering substitutes and complementary products, with gains confirmed across query-frequency strata and business verticals, and statistical significance validated through live A/B deployment at scale.
[IR-10] Dont Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution
链接: https://arxiv.org/abs/2606.01435
作者: Vikas Reddy,Sumanth Challaram
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:LLM-based memory systems increasingly maintain facts that evolve over time, where a recurring failure is conflict resolution: when a fact has multiple contradictory values, which should the agent return? MemoryAgentBench (MAB; Hu et al., 2026) makes this explicit in its FactConsolidation task: facts are numbered, the counterfactual has the higher serial, and agents are told newer facts have larger serials. Yet every published system underperforms: HippoRAG-v2 reaches 54% on single-hop (FC-SH), BM25 48%, Mem0 18%, and the temporal KG Zep/Graphiti just 7%. Multi-hop is near-unsolved (at most 7% across 22 systems). We argue the bottleneck is the assembly step: baselines leave conflict resolution to LLM-mediated retrieval or generation rather than version-aware aggregation. A matched-setup comparison (same backbone, retrieval, chunking, TOP_K) shows that replacing the LLM-judgment answer pipeline with candidate-extraction plus Python max(serial) yields +10.8 points on FC-SH (gpt-4o-mini), widening from +8 at 6K to +21 at 262K. This is a whole-pipeline effect (resolver, prompt, format, and temperature vary jointly); isolating the resolver is future work. The recipe reaches 78.0% on FC-SH (gpt-4o-mini), 94.8% (gpt-4o), and 30.2% on FC-MH (gpt-4o-mini, rising to 51.5% with gpt-4o) via a per-hop deterministic extension of Self-Ask. At matched-262K, it beats HippoRAG-v2 by +28 points and the best published FC-MH result by +20. The implication is corrective for the subfield: the bottleneck on conflict resolution is assembly (post-retrieval aggregation), not storage. A LongMemEval knowledge-update check shows the mechanism ports from max(serial) to max(timestamp) but only ties LLM judgment (57.8% vs 64.4%, n=45): deterministic aggregation is the right primitive for current-value conflicts and must be composed with question-type-aware handling for broader memory QA. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2606.01435 [cs.AI] (or arXiv:2606.01435v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01435 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-11] Differentially Private Datastore Generation for Retrieval-Augmented Inference ICPR-2026
链接: https://arxiv.org/abs/2606.01413
作者: Abdelrahman Abouelenein,Marwan Torki
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: Accepted at the 28th International Conference on Pattern Recognition (ICPR-2026)
Abstract:It is crucial for modern on-device AI systems that rely on retrieval-augmented inference to release and share datastores without compromising individual privacy. This can be achieved using Differential Privacy (DP), which provides a formal guarantee that ensures individual contributions remain indistinguishable, even under adversarial analysis. In this paper, we introduce a hashing-based probability generation framework designed to enable the creation and release of differentially private datastores. Our approach employs locality-sensitive hashing (LSH) to efficiently partition high-dimensional data into buckets. We then add calibrated DP noise to the accumulated vote for each bucket, generating a probability distribution across classes. Our method is broadly applicable to any pipeline requiring secure key,value datastore creation and release. We conducted experiments on seven datasets with varying sample sizes and class counts, ranging from 2 to 14. At epsilon=5, our released DP datastore achieves strong privacy protection with only an average 2.6% drop in accuracy. Finally, we benchmark DP datastore resilience to membership inference attacks, reducing attack accuracy to 53.60%.
[IR-12] Quantizing Intent: Cross-Domain Semantic IDs from Organic Activity for Industrial Ranking
链接: https://arxiv.org/abs/2606.01396
作者: Julie Choi,Haoran Ye,Zhiwei Ding,Bo Long,Benjamin Zelditch,Arpita Vats
类目: Information Retrieval (cs.IR)
备注:
Abstract:Ads click-through rate (CTR) prediction is constrained by sparse user supervision: most users engage with ads infrequently while generating dense behavioral evidence in organic surfaces such as feed. Transferring these cross-domain signals into ads ranking is difficult due to domain mismatch, serving cost, and production complexity. We introduce cross-domain user Semantic IDs (SIDs) derived from organic feed activity and show that behavioral activity richness governs cross-domain transfer quality: SIDs from user profile text yield +0.036% AUC, SIDs from an activity-tuned LLaMA-based user embedding model yield +0.107%, and SIDs from direct feed activity behavioral embeddings yield +0.213%. We further propose RQ-FSQ, a residual finite scalar quantization method that discretizes pre-trained embeddings while matching dense-embedding AUC at substantially smaller storage. Across two heterogeneous sources, RQ-FSQ matches or slightly exceeds dense source embeddings, achieving +0.351% AUC for Feed Activity at about 30x smaller storage and +0.265% AUC for Activity-Tuned LLaMA at about 280x smaller storage. We also introduce a Hierarchical Discrete Embedding module that encodes multi-level SIDs through prefix n-gram sparse embedding tables trained end-to-end under the CTR objective. In a large-scale industrial ads ranking system, cold-start segment analysis shows gains up to +1.522% for users with near-zero ad interaction history, validating cross-domain behavioral transfer as an effective bridge for sparse-history ranking.
[IR-13] FAiT: Frequency-Aware Inverted Transformer for Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2606.01306
作者: Peng He,Yao Liu,Yanglei Gan,Run Lin,Yuxiang Cai,Qiao Liu
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
备注:
Abstract:While Transformer-based architectures have established themselves as a dominant paradigm in Multivariate Time Series Forecasting (MTSF), their core self-attention mechanism inherently functions as a low-pass filter, systematically smoothing out high-frequency signals vital for sharp local changes. Recent advancements have increasingly incorporated frequency-domain operations to address this bias, however, most existing designs rely on fixed spectral bases and apply sequence-wise (uniform) modulation, implicitly assuming a time-invariant frequency response. This overlooks a key property of real-world series that their spectral characteristics often evolve over time, making uniform modulation insufficient for capturing fine-grained temporal dynamics. To tackle these limitations, we propose FAiT, a Frequency-Aware inverted Transformer. Specifically, FAiT rectifies the spectral bias internally through Inverted Attention, which interprets the attention map as a learnable low-pass operator and constructs a dedicated complementary high-pass branch by inverting the attention matrix to recover attenuated transient signals. Furthermore, FAiT introduces Dynamic Temporal-Frequency Modulation (DTFM), which synthesizes instance-conditioned weights to adaptively re-calibrate the energy of spectral sub-bands, enabling fine-grained control over evolving multi-scale patterns. Extensive experiments on widely used benchmarks demonstrate that FAiT consistently outperforms state-of-the-art Transformer-based and frequency-enhanced baselines, while maintaining computational efficiency.
[IR-14] DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation
链接: https://arxiv.org/abs/2606.01212
作者: Yuyang Gong,Miaokun Chen,Jiawei Liu,Zhuo Chen,Guoxiu He,Wei Lu,XiaoFeng Wang,Xiaozhong Liu
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
备注:
Abstract:Retrieval-Augmented Generation (RAG) systems are widely deployed and increasingly influential, but their reliance on external corpora exposes new security risks from poisoned retrieval content. Existing RAG attacks are largely focusing on individual queries or narrow topic-local query sets, which limits their practical reach and offers limited camouflage in real-world settings. In this paper, we introduce discourse-level opinion manipulation, a new threat model in which coordinated influence across a semantic query network induces opinion shifts over a holistic, multi-topic query space. We formalize this threat in a black-box setting and propose DiscourseFlip, an agentic, graph-guided attack that dynamically allocates a limited poisoning budget to maximize discourse-level opinion deviation. Extensive experiments demonstrate that DiscourseFlip consistently induces targeted opinion shifts across the contextualized query network and significantly outperforms existing baselines in terms of coverage and effectiveness. User studies further confirm that DiscourseFlip is effective while remaining well camouflaged from user detection. Moreover, systematic analyses show that existing mitigation strategies are ineffective against discourse-level manipulation, underscoring the urgent need for more robust and adaptive defenses to address discourse-level vulnerabilities.
[IR-15] st-Time Training for Zero-Resource Dense Retrieval Reranking ACL2026
链接: https://arxiv.org/abs/2606.01070
作者: Shiyan Liu,Yichen Li
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at KnowFM @ ACL 2026
Abstract:Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix W via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.
[IR-16] SkillPager: Query-Adaptive Intra-Skill Navigation via Semantic Node Retrieval
链接: https://arxiv.org/abs/2606.00822
作者: Zicai Cui,Zihan Guo,Weiwen Liu,Weinan Zhang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 20 pages, 6 figures
Abstract:Skill-based LLM agents increasingly rely on long procedural documents, but full-document prompting wastes tokens and dilutes information critical to execution. We study this setting as intra-skill retrieval, where the goal is to select a minimal, execution-sufficient context from a known skill document given a query. We present SkillPager, a two-stage framework that parses each Markdown skill into typed semantic nodes offline and leverages Maximal Marginal Relevance (MMR) to perform global, query-conditioned node selection online. On a benchmark of 395 skills and 1,975 queries, SkillPager achieves 78.89% LLM-judged context sufficiency, compared to 82.23% for the exhaustive full-document baseline, while reducing prompt tokens by 47.04%. A granularity ablation shows that applying the same retrieval algorithm to raw fixed-length chunks reaches a comparable 81.77% sufficiency but increases token cost by 28.81%, demonstrating that efficiency gains are driven by typed semantic granularity rather than the retrieval algorithm alone. Among graph-based baselines, SkillPager outperforms the strongest baseline by a margin of 12.16%. Further ablations show that supporting content is most effective when retained in the candidate pool and selected adaptively rather than removed by static heuristics. These results identify typed intra-document retrieval as a distinct access problem for skill-based agents.
[IR-17] SpikeHash: Learning Binary Codes with Spiking Neural Networks for Cross-Modal Hashing Retrieval
链接: https://arxiv.org/abs/2606.00740
作者: Yukuan Zhang,Jiarui Zhao,Shangqing Nie,Shengsheng Wang
类目: Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:
Abstract:Cross-modal hashing retrieval encodes heterogeneous data into compact binary codes for efficient Hamming-space search. Existing methods usually learn cross-modal semantics in continuous feature spaces and generate binary codes through a final sign operation, which weakly couples training optimization with discrete hash retrieval. We propose SpikeHash, a unified spiking framework that formulates cross-modal hashing as spike-state evolution, directional spike interaction, and competitive spike readout. Specifically, SpikeHash converts image and text features into multi-timestep spike sequences. In a shared Hamming space, the two spike sequences jointly drive the temporal evolution of a shared hash state. Cross-modal interaction is further performed through directional spike modulation, enabling each modality to influence the firing dynamics of the other. Crucially, SpikeHash replaces the conventional continuous hash head with a positive-negative spiking hash readout, where each hash bit is produced by temporal competition between paired spike channels. Experimental results show that SpikeHash achieves competitive retrieval accuracy on three benchmark datasets while reducing the parameter size, operation count, and estimated energy of the hash learning stage, suggesting a compact spiking alternative to conventional continuous hash mapping. The project page is available at this https URL.
[IR-18] Critic-R: Improving Agent ic Search using Instruction-tuned Retrievers with Natural Language Introspective Feedback
链接: https://arxiv.org/abs/2606.00590
作者: Md Zarif Ul Alam,Alireza Salemi,Hamed Zamani
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic search systems iteratively interact with retrieval models to answer complex queries. Despite substantial progress, optimizing retrievers for agentic search remains challenging, often requiring heavy co-training or gold-standard annotations that limit real-world applicability. We propose Critic-R, a framework that explicitly closes the feedback loop between the reasoning agent and the retrieval model during both inference and training. Critic-R introduces a critic model that evaluates the agent’s introspective reasoning trace after consuming retrieved evidence to determine whether the retrieved context sufficiently supports the next reasoning step. Critic-R has two complementary mechanisms: Critic-R-Zero, an inference-time query refinement loop that iteratively rewrites queries and retrieval instructions, and Critic-Embed, an optimization approach for retrieval models that leverages successful and failed refinement trajectories as automatic supervision without requiring manual relevance annotation. We evaluate Critic-R on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle. Results show that Critic-R significantly improves both retrieval quality and downstream answer accuracy.
[IR-19] rustworthy Recommendation in the Era of Large Language Models : Opportunities and Challenges
链接: https://arxiv.org/abs/2606.00540
作者: Bohao Wang,Yu Cui,Zhenxiang Xu,Jujia Zhao,Chenxiao Fan,Jizhi Zhang,Weiqin Yang,Shengjia Zhang,Sirui Chen,Yang Zhang,Xiaoyan Zhao,Wenjie Wang,Chongming Gao,Fuli Feng,Xiangnan He,Jiawei Chen
类目: Information Retrieval (cs.IR)
备注:
Abstract:The field of recommender systems (RS) is currently undergoing two profound paradigm shifts. From the perspective of objectives, the goal has shifted beyond mere recommendation accuracy to comprehensive trustworthiness, encompassing multiple dimensions such as robustness, fairness, and privacy preservation. From a technical perspective, Large Language Models (LLMs) have been extensively integrated into RS, reshaping the foundations of recommendation through richer semantic understanding, stronger intent reasoning, and more flexible user interactions. The convergence of these two shifts prompts a timely and pivotal question: how does the integration of LLMs reshape the landscape of trustworthy recommendation? In this work, we present a systematic review of trustworthy LLM-empowered recommendation. By comprehensively analyzing over 200 recent studies, we reveal that the introduction of LLMs acts as a double-edged sword. While their advanced mechanisms and user-friendly interfaces offer unprecedented opportunities to enhance trustworthiness, they simultaneously introduce new risks, such as novel forms of bias and hallucination-induced issues. To characterize this dual impact, we systematically identify 13 opportunities and 18 challenges across six fundamental dimensions of trustworthiness, and accordingly organize the existing literature into a novel taxonomy. We also provide a comprehensive review of commonly used datasets and evaluation metrics to facilitate empirical validation. Finally, we identify critical open challenges and outline future directions, hoping to inspire future research on this emerging topic.
[IR-20] UniPinRec: Unifying Generative Retrieval and Ranking at Pinterest Scale
链接: https://arxiv.org/abs/2606.00422
作者: Hanyu Li,Yi-Ping Hsu,Aditya Mantha,Prabhat Agarwal,Laksh Bhasin,Jialu Wang,Hongtao Lin,Bella Huang,Yaxin Li,Xinyi Li,Chuxi Wang,Kousik Rajesh,Hooshmand Shokri Razaghi,Shunyao Li,Zongyue Qin,Jaewon Yang,James Li,Dhruvil Deven Badani,Jiajing Xu,Charles Rosenberg
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Modern recommendation systems predominantly train retrieval and ranking as separate models despite both increasingly relying on large transformers encoding the same user behavior data, duplicating parameters, compute, and serving cost. Prior work unifies the model architecture but not the full pipeline: input formats, training procedures, and serving stacks remain fragmented across stages. We present UniPinRec, which achieves full-stack unification of retrieval and ranking at Pinterest: one input format, one model, one training stage, deployed within existing serving infrastructure. A shared transformer encodes the user action sequence into candidate-independent representations that branch into retrieval (ANN dot-product) and ranking (cross-attention) via task-specific heads. Three ideas make this work: (1) Masked Action Modeling (MAM) eliminates interleaving, enabling weight sharing without doubling context length; (2) Blended training examples pair action sequences with feedview impression slates to satisfy both objectives jointly; (3) Cross-stage KV cache sharing reuses user-history computation from retrieval for ranking, reducing total FLOPs versus serving two independent models. Deployed in the Pinterest core surfaces, UniPinRec delivers approximately +1% online engagement lift while cutting end-to-end serving latency by 11.1% and lifting QPS by 63.6%. To our knowledge, this is the first full-stack unification of retrieval and ranking, covering inputs, model, training and serving, deployed in a production recommendation system.
[IR-21] Masking Stale Observations Helps Search Agents – Until It Doesnt: A Regime Map and Its Mechanism
链接: https://arxiv.org/abs/2606.00408
作者: Haoxiang Zhang,Qixin Xu,Zhuofeng Li,Lei Zhang,Pengcheng Jiang,Yu Zhang,Julian McAuley
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 47 pages, 7 figures
Abstract:Long-horizon search agents accumulate large amounts of retrieved content across many tool calls, making context-budget efficiency increasingly important. A minimal intervention is to mask stale observations from the context as the trajectory progresses, but it remains unclear when this form of context management helps and why. We study observation masking through a systematic sweep over various agent backbones (4B to 284B parameters) and three retrievers on offline and live-web agentic search benchmarks. We find that the accuracy gain from masking follows an asymmetric inverted-U shape when plotted against the model’s accuracy without context management: a plateau under weak retrievers, a peak when a strong retriever meets a mid-capacity model, and a sharp collapse when the model is saturated. This pattern reflects the interaction between retriever recall and the model’s implicit filtering capacity, rather than either factor in isolation. Mechanistically, masking implements a token-for-turn trade-off: it removes observations the model has largely stopped attending to and pages the agent rarely re-opens. The added turns help when they convert failures into successes, but they fail when masking removes evidence the model would otherwise have used. We therefore reframe context management as a regime-dependent intervention and provide a holistic perspective for analyzing context use in agentic deep search. We release our scaffold and trajectories here (this https URL) to support future research.
[IR-22] LLM s Need Encoders for Semantic IDs Too
链接: https://arxiv.org/abs/2606.00324
作者: Xiangyi Chen,Zelun Wang,Xinyi Li,Yi-Ping Hsu,Jaewon Yang,Jiajing Xu
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal LLMs use dedicated encoders to bridge non-language modalities (vision encoders for images, depth models for audio codec tokens) because raw token embeddings alone cannot capture modality-specific structure. We argue that Semantic IDs (SIDs), the hierarchical codes used in generative recommendation, constitute another such modality: a SID level token’s meaning depends on its prefix context, yet current systems simply add SID tokens to the vocabulary and rely on training to learn these context-dependent meanings from scratch. We propose PrefixMem, a lightweight SID encoder based on prefix n-gram memory tables that provides the LLM with structured, prefix-conditioned representations at SID token positions. Like vision encoders in multimodal LLMs, PrefixMem can be pre-trained independently and then attached to any LLM for joint training. We evaluate on large-scale data from Pinterest across multiple LLM families and show that PrefixMem improves deepest-level SID accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative at matched training compute. The encoder’s benefit concentrates on hard examples where greedy decoding fails, with up to 77% relative accuracy gains, confirming that SID tokens benefit from a dedicated encoder just as other non-language modalities do.
[IR-23] Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems
链接: https://arxiv.org/abs/2606.00282
作者: Xiangyu Wang,Yawen He,Shivendra Pratap Singh,Han Huang,Mengtong Hu,Sharath Ciddu,Yi-Hsuan Hsieh,Erik Groving,Yi Ding,Jieming Di,Tony Wang,Min Yun,Xiaoyu Chen,Ling Leng,Rob Malkin
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 13 pages, 3 figures
Abstract:Large-scale recommendation systems operate across diverse domains, yet they face the challenges of data sparsity and noisy implicit feedback. Traditional approaches mitigate this via model-specific knowledge distillation from source domains to a target domain. Inspired by the transformative success of synthetic data generation in large language models (LLMs), we introduce Synthetic Cross-domain Augmentation and Learning for Recommendation (SCALR), a framework that generates synthetic user-item interaction events for a target recommendation domain by leveraging observed events from a source domain. SCALR decomposes cross-domain learning into two modular stages. First, it translates observed user events in source domains by framing event generation as estimating the likelihood that a user would interact with a target-domain item, conditioned on their observed interactions in a source domain. Second, downstream models train on these synthetic events as cross-domain learning objectives, where the synthetic events augment the target domain’s training data in a model-agnostic manner. Our approach yields statistically significant improvements in online A/B tests on an industrial recommendation platform. To the best of our knowledge, this is among the first works to explicitly frame cross-domain event transfer as synthetic data generation for recommendation systems.
[IR-24] Multimodal Music Recommendation System using LLM s
链接: https://arxiv.org/abs/2606.00125
作者: Srikar Prabhas Kandagatla,Sreehitha R. Narayana,Chandana Magapu,Swetha Mohan,Shamanth Kuthpadi,Hongjie Chen,Ryan A. Rossi,Franck Dernoncourt,Nesreen Ahmed
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.
[IR-25] SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector
链接: https://arxiv.org/abs/2606.00084
作者: Dineth Jayakody,Pasindu Thenahandi,Sampath Jayarathna
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance–performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.
[IR-26] Beyond Text and Tables: Vision-Language Model Integration in ComProScanner for Extracting Materials Data from Scientific Figures with High Accuracy
链接: https://arxiv.org/abs/2606.00065
作者: Aritra Roy,Enrico Grisan,Chiara Gattinoni,John Buckeridge
类目: Information Retrieval (cs.IR); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 18 pages, 3 figures
Abstract:Automated extraction of materials composition-property data from scientific literature has advanced considerably with the development of large language model-based pipelines; however, existing frameworks remain limited to textual and tabular content, overlooking the substantial proportion of quantitative property data reported exclusively in scientific figures. Here, we extend ComProScanner, a fully end-to-end multi-agent framework for automated composition-property database construction, with a native vision-language model (VLM) based figure extraction capability. The extension introduces a FigureExtractor utility for caption-keyword-based figure filtering across all supported publishers, and a GraphExtractorTool agent that passes extracted figures to a configurable VLM to recover composition-property pairs from scientific charts and plots. Four VLMs are selected for evaluation on the basis of the LMArena Diagram leaderboard with an input cost criterion of less than \ 1.50 per million tokens. Benchmarking on 50 piezoelectric ceramic articles from the established d_33 test corpus demonstrates that Gemini-3-Flash-Preview achieves the highest performance with a composition accuracy of 0.97 and a normalised F1 score of 0.97, whilst remaining the most cost-effective model among the four evaluated. We additionally introduce a range-based value error threshold parameter into the evaluation framework, providing a more physically meaningful assessment of numeric property values extracted from figures than exact value matching. These contributions establish VLM-integrated ComProScanner as the first materials-specific, fully automated, multimodal literature mining platform capable of extracting structured composition-property data from text, tables, and figures within a single unified pipeline.
[IR-27] Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs
链接: https://arxiv.org/abs/2606.00050
作者: Gregory Magarshak
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB); Information Retrieval (cs.IR)
备注: 6 pages; second in a series with the Magarshak Machine / SPACER paper and the Context paper
Abstract:We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension cost at every query, Grokers pushes intelligence to write time: autonomous Groker agents analyze nodes in a typed stream graph, extract structured attributes via governed language model (LM) calls, and inductively compose that understanding upward through dependency relations, writing enriched typed attributes that serve all future queries at zero additional LM cost. We prove three formal properties: (1) the Byte-Identity Theorem, establishing that context blocks assembled from a transactionally-maintained denormalization index are byte-identical across LM turns between semantic changes, enabling KV-cache hit rates approaching 100%; (2) the Accumulation Monotonicity Theorem, establishing that the fraction of interactions resolved without LM calls is non-decreasing in the number of completed interactions under a governed wisdom library growth protocol; and (3) the Dual-Traversal Ordering Theorem, establishing that top-down generation and bottom-up comprehension are the unique correct traversal orderings for their respective tasks over a dependency DAG, and that their composition closes into a complete generation-comprehension cycle. We further present a deterministic alternative to embedding-based semantic search, with a synonym caching protocol whose LM fallback rate converges to zero for finite-vocabulary domains. A reference implementation is provided in the open-source Qbix / Safebox / Safebots stack.
[IR-28] Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel
链接: https://arxiv.org/abs/2606.02156
作者: Zahra Tabatabaei,Jon Sporring,Mark Bremholm Ellebæk,Alaa El-Hussuna
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.
人机交互
[HC-0] Fostering Emotional Perspective-Taking: An Exploration of Affective Face-Tracking Interactions in the VR Narrative Rekindle
链接: https://arxiv.org/abs/2606.02425
作者: Hector Fan,Casper Hartveld,Mark Sivak
类目: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: 5 pages, 5 figures. Interactivity paper accepted to DIS Companion '26 (Designing Interactive Systems Conference), Singapore, June 2026
Abstract:Interest in leveraging emotions in Interactive Digital Narrative (IDN) has been growing, and Virtual Reality (VR) offers rich access to real-time biometric data such as facial expressions; yet this capability remains underexplored in novel IDN design. Existing approaches typically treat emotion input superficially, such as adjusting system difficulty or aesthetics, but rarely influence how players experience the narrative itself. Prior work also lacks a focus on a specific authored narrative. We propose an experimental affective interaction model that uses a VR headset’s built-in face-tracking capability to recognize player emotional states, fostering “emotional perspective-taking” between the player and their embodied story character, thereby deepening the player’s emotional connection to the character and their narrative engagement with the VR experience.
[HC-1] Attention Dynamics and Adaptive Decision Support in C5ISR: A Recurrence Quantification Analysis of Visual and Multimodal Attention Guidance Effects on Mission Performance
链接: https://arxiv.org/abs/2606.02382
作者: Hyun-Gee Jei,Caleb J. Armstrong,Farzan Sasangohar
类目: Computational Complexity (cs.CC); Human-Computer Interaction (cs.HC); Computation (stat.CO)
备注: 11 Figures, 3 Tables
Abstract:Modern command, control, communications, computers, cyber, intelligence, surveillance, and reconnaissance (C5ISR) environments place substantial attentional demands on mission commanders. Failures in attention allocation in these high-risk settings can have severe operational consequences. This study investigates the efficacy of gaze-driven, attention-guided adaptive decision support tools, including visual-only and multimodal designs, in a high-fidelity simulated military command center. To characterize gaze and attentional dynamics during interaction with these tools, recurrence quantification analysis was applied to eye-tracking data. Stepwise regression using the Bayesian information criterion was then used to identify recurrence-based gaze metrics associated with performance. Results showed that the multimodal adaptive decision support tool was associated with significantly higher performance than the visual-only attention-guided tool. Average diagonal line length showed a negative linear association with performance, whereas entropy showed a positive linear association. Recurrence rate, determinism, and entropy also showed nonlinear quadratic relationships with performance. In particular, recurrence rate and determinism followed an inverted-U pattern consistent with the Yerkes-Dodson law. These findings suggest that effective performance in dynamic C5ISR contexts depends on a balance between structured and flexible visual scanning, and that recurrence-based gaze metrics can help characterize attentional dynamics during interaction with adaptive decision support systems.
[HC-2] WAXAL-NET: Finetuned Edge ASR Across 19 African Languages
链接: https://arxiv.org/abs/2606.02375
作者: Victor Tolulope Olufemi,Oreoluwa Babatunde,Ramsey Njema,Bolarinwa Gbotemi,Wanchi Lucia Yen,John Uzodinma,Sunday Ajayi,Oluwademilade Williams,Kausar Moshood,Innocent Elendu Anyaele,Akebert Arefaine,Candace Hunzwi,Wongel Dawit Daniel,Emmilly Namuganga,Cleophas Kadima,Athanase Bahizire,Onitsiky Ranaivoson,Emmanuel Aaron,Nicholaus Ladislaus,Idris Muhammed,Jonathan Enoch Simenya,Martin Koome,Matewos Tegete Endaylalu,Peter Ifeoluwa Adeyemo,Hondi Prisca Birindwa,Ukachi Agnes Eze-Mbey,Yacoba Oduro-Yeboah,Pericles Adjovi,Mikel K. Ngueajio,Toluwani Aremu,Prasenjit Mitra
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注:
Abstract:We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of 38.0% compared to 64.9% for the best zero-shot baseline, a 26.9 percentage-point reduction using models 3-40\times smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all 19 languages.
[HC-3] Quantitative Movement Testing: Measuring Patient Movements from a Single Smartphone Video
链接: https://arxiv.org/abs/2606.02301
作者: Pranav Mahajan,Amanda Wall,Eleonora Maria Camerone,Julie Stebbins,Eoin Kelleher,Shuangyi Tong,Annina Schmid,Katja Wiech,Anushka Irani,Ben Seymour
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chronic pain diminishes quality of life by decreasing functional ability, yet objectively measuring this functional impact remains challenging in real-world settings. While optical motion capture provides high precision for assessing altered movement quality, it is costly and restricted to laboratory environments. We aimed to develop and validate Quantitative Movement Testing (QMT), a computer vision pipeline extracting 3D kinematic biomarkers from standard monocular smartphone video, balancing clinical accessibility with biomechanical accuracy. We validated the QMT pipeline, utilising deep learning-based 3D pose-estimation, against gold-standard optical motion capture in healthy controls (N=13). Following leave-one-subject-out calibration to correct systematic bias, we deployed QMT in two prospective clinical cohorts to assess real-world utility: a pre- and post-intervention trial for fibromyalgia patients, and a 30-day longitudinal at-home monitoring study of chronic sciatica patients and healthy controls. In laboratory validation, QMT extracted clinical kinematic metrics with high agreement to optical motion capture, yielding strong correlations (r 0.85) and low mean absolute errors. QMT demonstrated high test-retest reliability (r 0.86) in fibromyalgia patients and successfully tracked day-to-day movement fluctuations in chronic sciatica. While real-world home settings introduced higher measurement variance than lab settings, QMT found group-level differences between healthy controls and sciatica patients based entirely on remote recordings. Monocular 3D pose estimation offers a scalable alternative to traditional assessments. QMT provides an objective, accessible biomarker for tracking disease progression and treatment response in clinical trials, though further research is needed to optimise reliability in home environments.
[HC-4] Guided Sensemaking: Agents in Collaborative Deliberation
链接: https://arxiv.org/abs/2606.02260
作者: Aaditya Bhatia,Navdeep Kaur Bhatia,Marc-Antoine Parent,Jack Park
类目: Human-Computer Interaction (cs.HC)
备注: Presented at Tools for Thought (TfT) workshop at CHI 2026
Abstract:Generative AI systems are aggressively reshaping how students engage with information and perform cognitive work; convenience-oriented use has the potential to displace effortful reasoning, reflection, and learning, especially for those who lack domain expertise and effective human-AI interaction strategies. Current AI tools are heavily focused on chat-style interfaces geared towards answer generation and efficiency in a linear and fragmented stream of text, offering limited support for structured reflection, argument construction, and sensemaking in collaborative contexts. We introduce Guided Sensemaking, an AI-augmented multiagent discourse platform that facilitates composition of well-thought-out ideas around a central question, provides scaffolding for critical thinking, and enables visualization of argumentative structure to support critical thinking and collaborative deliberation. The system uses several interactive agents to provide context-sensitive questioning prompts and a scaffolding for thought that exposes thematic clusters, agreements, and points of contention without collapsing diverse perspectives. This paper proposes a conceptual design and interaction paradigm that positions generative AI not as a shortcut to answers but as a research partner that externalizes reasoning, preserves user agency, and fosters structured, traceable sensemaking in educational and civic contexts.
[HC-5] Context-Aware Workflow Decomposition for Automated Mobile UI Annotation Using Multimodal Large Language Models
链接: https://arxiv.org/abs/2606.02208
作者: Athar Parvez,Muhammad Jawad Mufti,Muqaddas Gull,Omar Hammad
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Accurate mobile user interface annotation is important for UI understanding, accessibility tools, automated testing, dataset construction, and GUI agents. However, mobile screens are difficult to annotate because they often contain small, dense, nested, and visually ambiguous elements. Multimodal large language models can help automate this process, but their outputs are sensitive to prompt design and the organization of annotation tasks. This paper studies automated mobile UI annotation from a workflow design perspective, focusing on improving annotation precision. Rather than asking the model to annotate all UI elements in a single step, the task is divided into smaller context-aware stages, allowing related UI elements to be handled with clearer instructions and useful screen context. The proposed pipeline uses structured prompts, schema-constrained JSON outputs, and element-specific annotation instructions. Experiments are conducted on expert-annotated mobile UI screens from the MUIAnno dataset, using eight common UI element types: button, tab, clickable text, card, label, plain text, icon, and image. Four workflow strategies are evaluated: one-step, two-step, four-step, and eight-step annotation. Results show that the two-step workflow achieves the highest precision, while deeper decomposition improves recall but produces more false positives. Additional grouping experiments show that annotation quality depends on both workflow depth and element-class grouping. Overall, careful workflow design can make LLM-based mobile UI annotation more reliable for UI understanding, dataset construction, and GUI agent development.
[HC-6] Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment
链接: https://arxiv.org/abs/2606.02082
作者: Xiyang Huang,Renxiong Wei,Yihuai Xu,Zhiyuan Chen,Keying Wu,Jiayi Xiang,Buzhou Tang,Yanqing Ye,Jinyu Chen,Cheng Zeng,Min Peng,Qianqian Xie,Sophia Ananiadou
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.
[HC-7] Respectful Things: Adding Social Intelligence to Smart Devices
链接: https://arxiv.org/abs/2606.02037
作者: Max Van Kleek,William Seymour,Reuben Binns,Nigel Shadbolt
类目: Human-Computer Interaction (cs.HC)
备注: In Proceedings of the 2018 Living in the Internet of Things: Cybersecurity of the IoT Conference
Abstract:In this paper, we propose that the idea of devices respecting their end-users may serve as a strong design goal for highly personal and intimate smart devices. We ask what respect is, how it shapes interaction, and how good-faith simulation of respect might inform user-friendly smart device design. Respect is a natural and integral part of natural human relationships that is seen to shape work and personal relations. In a basic sense, this is the core purpose of smart things: we expect them to be ready and willing to help us. In this vein, we distil the characteristics of more complex respectful behaviours into 4 main types relevant to smart devices, drawing from philosophical analyses of the conceptual dimensions of respect: directive respect, obstacle respect, recognition respect, and care respect. We discuss the implications of each of these kinds of respect for the future of smart personal devices.
[HC-8] AutoBG: A Board Game Design Assistant with Interactive Ideation Iterative Rulebook Generation and Individualized Feedback
链接: https://arxiv.org/abs/2606.01976
作者: Zizhen Li,Chuanhao Li,Yibin Wang,Jianwen Sun,Yukang Feng,Fanrui Zhang,Mingzhu Sun,Yifei Huang,Kaipeng Zhang
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Designing a board game demands both thinking as a designer and experiencing as a player, while iterating through repeated prototyping and playtesting cycles, making it a cognitively intensive creative task well suited for human-AI collaboration. However, current systems lack end-to-end support to guide designers through the complete workflow from vague early ideation to iterative rulebook revision and audience testing. To this end, we present AutoBG, a board game design assistant built around critic-driven iterative refinement, comprising four specialized modules: BG-Ideator guides designers via multi-turn dialogue to produce structured design drafts; BG-Realizer generates complete rulebooks from drafts and revises them in a closed loop with BG-Critic, which diagnoses design flaws and gates each revision so that only verified improvements are accepted; and BG-Persona simulates individualized feedback from 150 real player profiles. Together, these modules enable designers to go from an initial idea to a polished, audience-tested rulebook within a single integrated workflow. The system is built on 2.2K structured rulebooks and 180K quality-filtered real player reviews, with task-specific training data derived for each module. Experiments on 207 held-out games show that AutoBG substantially outperforms state-of-the-art baselines (e.g., GPT-5.4), generating rulebooks that approach the quality of published games. Furthermore, a user study with 30 participants across diverse experience levels confirms that AutoBG effectively reduces blank-page anxiety, surfaces hidden design flaws, and provides highly rated, practical assistance throughout the creative process.
[HC-9] rust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM -Generated Multi-File Changes
链接: https://arxiv.org/abs/2606.01969
作者: Lo Gullstrand Heander,Agnia Sergeyuk,Ilya Zakharov,Emma Söderberg,Nikita Mukhortov
类目: oftware Engineering (cs.SE); Human-Computer Interaction (cs.HC)
备注: Submitted to ESEM SEIP 2026
Abstract:Background: Developers increasingly review multi-file code changes generated by LLM-based agents, yet no validated end-to-end workflow or IDE tooling design exists for this scenario. Aims: We investigate (RQ1) the challenges developers face when reviewing LLM-generated multi-file changes and (RQ2) how developers envision effective workflows for this task. Method: In collaboration with JetBrains, we conducted a participatory design study structured using the double-diamond design process with Discover, Define, Develop, and Deliver phases. Industry practitioners participated in the Discover phase (N=17); seven of these returned for the Develop phase. The Define phase was an author-led synthesis. The Deliver phase produced a conceptual design and a high-fidelity semi-interactive prototype evaluated through a follow-up survey with N=43 practitioners. Results: Participants identified trust-calibration as the central challenge. The study yielded a three-level review workflow (overview, file-analysis, code snippet review) supported by seven design constructs (chunk, risk-per-line, risk-per-file, judge, walk-through, zooming in/out, and security cage). In the validation survey, all three workflow levels scored above the neutral midpoint (means 3.50–3.91 on a five-point scale). Of the respondents, 63% expected reduced overall review effort, and 52% reduced trust-assessment effort, relative to their current tools. These findings suggest that the design constructs indicate a positive direction for future tool development. Conclusions: Reviewing LLM-generated multi-file changes is a trust-calibration problem rather than a diffing problem. The three-level workflow and the seven constructs we report give tool designers a conceptual framework for building AI-ready code review tools that surface risk and confidence signals at the granularity at which developers allocate attention. Comments: Submitted to ESEM SEIP 2026 Subjects: Software Engineering (cs.SE); Human-Computer Interaction (cs.HC) ACMclasses: H.5.2; D.2.9; D.2.6 Cite as: arXiv:2606.01969 [cs.SE] (or arXiv:2606.01969v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.01969 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lo Gullstrand Heander [view email] [v1] Mon, 1 Jun 2026 09:32:25 UTC (4,845 KB) Full-text links: Access Paper: View a PDF of the paper titled Trust-Calibrated Code Review: A Participatory Design Study of Review Workflows for LLM-Generated Multi-File Changes, by Lo Gullstrand Heander and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.SE prev | next new | recent | 2026-06 Change to browse by: cs cs.HC References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[HC-10] A Minimalist Brain-Computer Musical Interface for Real-Time Emotion-Driven Sonification: System Design and Preliminary Evaluation
链接: https://arxiv.org/abs/2606.01473
作者: Pablo A. Monroy-D’Croz,Rafael Ramirez-Melendez,Julian Cespedes-Guevara
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:This paper presents a minimalist brain-computer Musical Interface (BCMI) that functions as a real-time affective sonification system, translating prefrontal EEG activity into adaptive music. Emotional valence is estimated from frontal alpha asymmetry (AF7/AF8) and mapped to musical features such as mode, tempo, rhythmic density, and pitch register through a stochastic generative algorithm. The system integrates wireless EEG acquisition, real-time Python signal processing, and Ableton Live-based music generation synchronized via Lab Streaming Layer. An experiment with 22 participants investigated whether intentional emotional self-induction could modulate the BCMI neurofeedback signal. Linear mixed-effects analyses found no significant effects of target emotion or time, indicating that the frontal alpha asymmetry signal did not reliably distinguish instructed emotional states. Individual differences, including musical training and acting experience, explained more variance than the experimental manipulation, which accounted for only 0.40% of total signal variance. These findings highlight the challenges of using frontal alpha asymmetry as a voluntary control signal for closed-loop emotion regulation and suggest methodological directions for future BCMI research.
[HC-11] What LLM s Must Forget to Teach Effectively: A DIY Approach to Premodern Japanese Language Pedagogy
链接: https://arxiv.org/abs/2606.01410
作者: Ariel Stilerman,Andrew Nelson,Alan Cheng,Caleb Langley,Sera Wang,Camilla Piana,Pelin Çılgın,Qianhe Qin,Teisha Nishimitsu,Liaoliao Zhang,Huiting Liu,Josh Eyre,Gavin Sherry
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:We discuss a novel approach to Premodern Japanese Language Pedagogy (PJLP) with potential applications in other languages and fields. The integration of artificial intelligence into education has largely operated as a top-down project, affording minimal agency to everyday users. This dynamic mirrors the broader frontier model ecosystem, which concentrates massive human and financial resources within a few labs. Drawing inspiration from grassroots initiatives such as the DIY and Maker movements, this paper advocates for an approach to AI in Education that fosters instructional and student agency over the pedagogical process. Specifically, we discuss a tutoring framework for textual analysis in the context of a graduate seminar in premodern Japanese literature, as well as a bilingual interactive dictionary and a conversational partner created for a language course in Classical Japanese. Created through prompt engineering as custom instances of a Large Language Model (LLM), these three tools are designed to counteract the tendency of out-of-the-box LLMs to either bypass student effort through over-explanation or misguide learners via hallucinations. To illustrate how this approach can promote active comprehension and pedagogical alignment, we provide transcripts (logs) of actual exchanges, sample instructions (system prompts), and guidance for instructors curious about exploring this approach in a variety of fields (starter kit).
[HC-12] Institutional Trust and the Domestic AI Advantage: Evidence from DeepSeek and ChatGPT Users in China
链接: https://arxiv.org/abs/2606.01228
作者: Jiashen Huang,Yu Jia,Xu Pan
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 48 pages
Abstract:Public trust in generative artificial intelligence exhibits increasingly divergent patterns across national contexts, yet prevailing research largely overlooks the macro-structural forces underlying this divergence. This study argues that trust in AI is not merely a technical response to performance but a product of institutional refraction. We propose an ``Institutional Prism’’ framework to demonstrate how institutional trust shapes user trust in domestic (DeepSeek) and global (ChatGPT) large language models. Drawing on Cognitive-Affective Trust Theory, we distinguish between cognitive and affective dimensions of trust and analyze survey data from 405 Chinese users. The findings show that higher institutional trust is positively associated with stronger affective trust in domestic AI models and shifts cognitive evaluations in a more favorable direction. While under lower institutional trust, this domestic advantage weakens. These findings reveal that institutional trust has emerged as a core dimension of AI trust formation. By linking micro-level psychological judgments with macro-level governance, this research contributes a new perspective to human-machine communication.
[HC-13] From Craft Practice to Aesthetic Cognition Transmission: Workflow Cognition Translation for AI-native Intangible Cultural Heritage Education
链接: https://arxiv.org/abs/2606.01203
作者: Annie Yuan
类目: Human-Computer Interaction (cs.HC)
备注: 22 pages, 7 figures
Abstract:Intangible Cultural Heritage (ICH) education has traditionally relied on apprenticeship, embodied participation, and long-term engagement with masters, materials, and cultural environments. While these modes of transmission remain essential, they are difficult to scale. Existing digital heritage initiatives have expanded documentation and access, but often preserve artefacts, procedures, and representations of practice rather than the aesthetic and cognitive structures through which expertise operates. This paper argues that the future challenge of ICH education is not only the transmission of craft techniques, but the scalable transmission of aesthetic cognition: the perception, judgement, interpretation, and culturally situated meaning-making through which aesthetic expertise develops. Drawing on aesthetic education, tacit knowledge, cognitive apprenticeship, and expert cognition, we propose a shift from craft transmission to Aesthetic Cognition Transmission. To support this shift, we introduce Workflow Cognition as a model of how experts coordinate perception, judgement, decision-making, and action within evolving workflows. We then propose Workflow Cognition Translation as a methodological framework for transforming expert workflow cognition into computable educational representations for AI-native learning systems. The paper makes three contributions: it reframes ICH education around aesthetic cognition transmission; introduces Workflow Cognition Translation as a method for representing expert aesthetic cognition; and outlines an AI-native cognitive apprenticeship infrastructure involving AI Expert Twins, workflow-based tutoring, and progressive learner participation. Rather than replacing masters, workshops, or embodied practice, the framework positions AI as a cognition mediation infrastructure for expanding access to heritage expertise.
[HC-14] pcbGPT : Automatic PCB Schematic Synthesis from Natural Language Requirements
链接: https://arxiv.org/abs/2606.01188
作者: Tobias King,Steven Kehrberg,Michael Beigl,Tobias Röddiger
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Translating natural-language hardware requirements into correct printed circuit board (PCB) schematics remains difficult in embedded, IoT, and wearable development. Designers must choose compatible components, interpret datasheets, add support circuitry, and expose correct interfaces before layout and prototyping can begin, while many such circuits cannot be validated through straightforward simulation. We present pcbGPT, a grounded system for generating editable KiCad schematics from natural-language specifications. pcbGPT represents circuits in a Python DSL and combines tool-augmented synthesis with component-library search, datasheet-grounded design knowledge, execution-based checking, structural and semantic validation, and an interactive web workflow that supports iterative refinement and synchronization with KiCad projects. We evaluate the system on 20 embedded schematic-generation tasks with reference implementations, required components, and interface constraints that enable automatic comparison. The best model reaches overall pass@1 of 0.90 and pass@5 of 1.00; pass@1 is 1.00 on basic and easy tasks, 0.91 on medium tasks, and 0.72 on hard tasks. These results, together with failure analysis, show that pcbGPT can already generate useful, reviewable first-draft schematics for early prototyping, but is not yet reliable enough to replace expert review.
[HC-15] Relational Intervention During Functional Collapse in Large Language Models : A Lexical-Statistical Ablation and a Structure x Register Factorial
链接: https://arxiv.org/abs/2606.00935
作者: Franco Santana,Horacio Vico
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 12 pages, 5 figures. Preprint
Abstract:We test whether a relational-style intervention delivered during functional collapse in a small language model produces post-collapse behavior distinguishable from technical feedback, from a lexically-matched scrambled control, and from each of the two pragmatic dimensions in isolation. Using Qwen3.5-4B with a deliberately broken bash tool, we run 300 episodes across six conditions in a matched-pairs design (50 tasks): no intervention (A), technical/impersonal (B), relational/first-person ©, scrambled relational (D), technical/first-person (E), and relational/impersonal (F). E and F form a 2x2 factorial with B and C that dissociates relational structure (acknowledgment, absolution, agency restoration, unconditional acceptance) from sender register (first-person vs. impersonal). We report two main findings. First, an attention-behavior dissociation: attention follows lexical surprise (D F C E B, all q_FDR 10^-10), with the scrambled message capturing the most attention; yet behaviorally A ~ B ~ D E ~ F C. Second, the factorial localizes the C effect: neither relational structure alone (F) nor first-person register alone (E) replicates C’s behavioral signature; main effects of both dimensions are individually significant, and the structure x register interaction is significant on persistence (p = 0.046). A third dissociation emerges in emotion probes: F tracks C on 7 of 8 probes despite producing only baseline behavior, indicating that relational structure alone installs a probe-level state that only translates into behavior when paired with first-person register. The model’s processing decomposes into three dissociable stages: attention (ordered by lexical surprise), probe-level state (ordered by structure), and behavior (ordered by the conjunction of both).
[HC-16] MIA: A Visual Analytics System for Multimodal Spectral Imaging Data
链接: https://arxiv.org/abs/2606.00874
作者: Hennes Rave,Katharina Kronenberg,Hannes Gödde,Lea Tobergte,Michael Holtkamp,Julia Werner,Peter Bohrer,Fabian Lohöfer,Rickmer Braren,David Clases,Uwe Karst,Lars Linsen
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Hyperspectral bioimaging techniques such as infrared (IR) microscopy and laser ablation-inductively coupled plasma-mass spectrometry (LA-ICP-MS) produce high-dimensional, spatially resolved datasets that require sophisticated analysis to reveal chemically and anatomically meaningful structures. Existing software solutions are typically modality-specific and cover only parts of the analytical workflow, forcing researchers to transfer data across multiple tools and manually reconcile results. We present MIA (Multiscale Image Analysis), a modality-agnostic visual analysis environment that integrates the full exploratory workflow – from spectral preprocessing and dimensionality reduction to interactive segmentation and spectral similarity analysis – within a single, tightly coupled interface. MIA supports hierarchical and landmark-based embeddings to handle datasets of varying scale and complexity, interactive and automatic segmentation with a shared state across all linked views, and multimodal analysis of co-registered datasets from different instruments. We demonstrate the effectiveness of MIA through three use cases drawn from real analytical chemistry workflows: (1) the recovery of biologically meaningful tissue compartments through derivative preprocessing and hierarchical embedding, (2) pigment identification via spectral similarity search with spatial overview, and (3) multimodal tissue characterization combining molecular IR and elemental LA-ICP-MS data. Qualitative feedback from domain expert collaborators confirms that MIA reduces the need for tool-switching and supports analytical insights that are difficult to obtain with existing software.
[HC-17] Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning
链接: https://arxiv.org/abs/2606.00851
作者: Sukru Samet Dindar,Riki Shimizu,Xilin Jiang,Nima Mesgarani
类目: ound (cs.SD); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:
Abstract:Empathetic spoken dialogue systems must infer a user’s emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user’s speech and, when available, explicit affect specifications provided as a continuous valence–arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
[HC-18] ErgoGlide: A Wearable Trackball Device for Ergonomic Text Entry in Virtual Reality
链接: https://arxiv.org/abs/2606.00823
作者: Muhammad Abu Bakar,Yu-Ting Tsai,Muhammad Imran,Yan-Ann Chen
类目: Human-Computer Interaction (cs.HC)
备注: 25 pages, 15 figures
Abstract:In virtual reality, it is challenging to achieve satisfactory text entry speed/accuracy, ergonomics, usability, and learnability. To address this issue, we developed ErgoGlide, a novel lightweight and compact wearable device that facilitates text entry tasks in virtual environments. The proposed ErgoGlide can be regarded as a small trackball that is wearable on a user’s finger like a ring. By using ErgoGlide with a hive-like virtual keyboard, the user can rotate the ball for key selections, making text entry intuitive and accurate. We conducted three user studies to evaluate ErgoGlide and found that key confirmation techniques have significant effects on text entry speed and the hive-like keyboard design significantly reduced thumb movements. Furthermore, ErgoGlide can significantly improve typing accuracy, ergonomics, and usability over previous text entry methods. Experimental results also indicated that the typing speed of ErgoGlide can be notably improved after training.
[HC-19] Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems
链接: https://arxiv.org/abs/2606.00807
作者: Nicholas Davis
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Traditional artificial intelligence has largely conceptualized intelligence as isolated computation occurring within bounded agents. Across classical AI, machine learning, and many generative systems, the dominant unit of analysis remains the individual model or autonomous system evaluated through outputs, benchmarks, prediction accuracy, or optimization performance. While these approaches have produced major advances, they often under-theorize the role of interaction in the emergence of intelligence, creativity, meaning, and adaptive behavior. This paper proposes interaction as the primary unit of analysis for co-creative AI and interaction-centered intelligence more broadly. Drawing from distributed cognition, embodied cognition, enaction, participatory sense-making, human-computer interaction, and computational creativity, the paper traces a historical progression toward increasingly relational accounts of intelligence. Building upon prior work in Creative Sense-Making, quantified co-creation, and co-creative systems such as the Drawing Apprentice and AI Drawing Partner, it argues that intelligence emerges through evolving interaction dynamics among agents, environments, and socio-technical systems rather than solely through internal computation. The paper introduces Interaction-Centered Intelligence as a framework for understanding human-AI co-creation, collaborative emergence, adaptive participation, and interactional dynamics. Rather than evaluating intelligence solely through generated outputs, the framework emphasizes interaction trajectories, coordination patterns, participatory engagement, adaptive regulation, and interactional drift unfolding through time. Implications for explainable co-creative AI, hybrid intelligence, enactive AI, and future human-AI systems are discussed.
[HC-20] A multimodal dataset of photoplethysmography and continuous behavioral responses to ASMR and nature videos
链接: https://arxiv.org/abs/2606.00752
作者: Tushar Das,Daigo Hozaki,Koushlendra Kumar Singh,Hirohito M. Kondo
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Human-Computer Interaction (cs.HC)
备注:
Abstract:Autonomous Sensory Meridian Response (ASMR) is a somatosensory phenomenon characterized by pleasant tingling sensations and cardiovascular slowing. However, ASMR research has been hindered by a dearth of standardized, open-access multimodal datasets. To address this limitation, we present REST-ASMR (Response to Environmental Sensory Triggers), a synchronized multimodal dataset designed to capture behavioral reports and physiological dynamics during ASMR, with nature-relaxation videos as control stimuli. The dataset includes high-resolution photoplethysmography (PPG), time-aligned audiovisual stimuli, and continuous subjective annotations from 34 participants. Technical validation showed high stimulus efficacy (97% responder rate), significant stimulus-specific inter-subject agreement (p 0.05), and a robust PPG-derived ASMR-specific cardiovascular deceleration. Additionally, a Bidirectional Long-Short Term Memory model successfully predicted subjective ASMR tingle states, achieving video-level ASMR vs. Nature classification with perfect accuracy and a frame-level global mean accuracy of 75.51%, macro F1-score of 71.86%, and 100% Nature-baseline specificity, under a strict, leakage-free subject-video double-independent 4-fold cross-validation. REST-ASMR constitutes a dense temporal foundation for affective computing, multimodal research, and the development of personalized models of relaxation-related responses.
[HC-21] Knowing When to Move: Evidence Accumulation Models of Human Behavior in Traffic
链接: https://arxiv.org/abs/2606.00727
作者: Floor Bontje,Felix van Waveren,Leendert van Maanen,Bhargav Nallapu,Gustav Markkula,Arkady Zgonnikov
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Evidence accumulation models provide a formal framework for studying decision making as a dynamic process unfolding over time. While these models have been extensively developed and reviewed in laboratory paradigms, their structured application in complex, ecologically valid domains has received comparatively little attention. Road traffic is a particularly relevant context for studying sustained, embodied perception action behavior, where decisions unfold under time pressure and involve continuous control and ongoing perception-action coupling. Examining how EAMs have been applied in this domain may therefore offer insights beyond discrete laboratory tasks toward decision making in real-world behavior. This semi-systematic review synthesizes 28 studies (2014-2026) applying EAMs to traffic-related behavior. We organize the literature along two dimensions: 1) modelling level, distinguishing models at the level of discrete decision-making and models at the level of continuous action control, and 2) model architecture, distinguishing evidence accumulation as either a stand-alone decision model or an embedded component within broader perception-action or interaction frameworks. These distinctions are associated with systematic differences in model architecture, parameterization, data usage, and validation strategies, reflecting task specific demands. By providing a structured overview of these patterns, this review clarifies how EAMs are currently instantiated in traffic contexts and highlights methodological challenges and future directions both in traffic modelling and in modelling of decision-making more broadly. Promising directions include laboratory work on evidence accumulation in sustained and time-varying tasks, interactive multi-individual decision-making, and the use of neurophysiological measures to identify the perceptual evidence underlying complex perception-action behavior.
[HC-22] Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation
链接: https://arxiv.org/abs/2606.00629
作者: Nelly Garcia,Aditya Bhattacharjee,Gabryel Mason-Williams,Israel Mason-Williams,Emmanouil Benetos,Joshua Reiss
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: DaFx 2026
Abstract:Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool’s workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.
[HC-23] A Four-Tier Communication Architecture and Sim-to-Real Validation of a Graphical Open-Source Platform for Robotic Engineering Education
链接: https://arxiv.org/abs/2606.00550
作者: Thien Tran,Khang Duong,Minh Tran,Jonathan Kua,Thuong Hoang,Jiong Jin
类目: Human-Computer Interaction (cs.HC); Emerging Technologies (cs.ET); Robotics (cs.RO)
备注: 4 pages, 4 figures, accepted as a Work-in-Progress (WiP) paper, on the 24th IEEE International Conference on Industrial Informatics (INDIN), 26-29 July, 2026, Melbourne, Australia
Abstract:The persistent challenge in scaling authentic manipulator education within university laboratories is a structural dichotomy: commercial digital twins are often cost-prohibitive and rigidly scripted, whereas open-source robotics middleware (ROS) imposes steep technical and syntax barriers for novices. To resolve this logistical and educational friction, this Work-in-Progress (WiP) paper proposes a scalable four-tier communication architecture tailored for sustainable robotic curricula. Rather than focusing on software application design, our study examines the underlying data exchange mechanisms required to bridge visual conceptual environments with physical robotic endpoints, utilizing the Graphical Open-Source Platform (GOSP) as a foundational instantiation. This WiP details the framework’s technical integration of 3D visual armature modeling with a robust ROS middleware backend, emphasizing the serialization, routing, and encapsulation of intricate communication routines. Preliminary sim-to-real validation using multi-axis spatial trajectories confirms that encapsulating these communication pipelines provides a sufficient fidelity hardware-agnostic pathway. By bridging virtual design and physical execution, this architectural blueprint offers a viable infrastructure for engineering education.
[HC-24] CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space
链接: https://arxiv.org/abs/2606.00472
作者: Hung Q. Vo,Huy Q. Vo,Son T. Ly,Zhihao Wan,Anh-Vu Nguyen,Hong Zhao,Jianting Sheng,Stephen T. C. Wong,Hien V. Nguyen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.
[HC-25] Literary Emotions in Motion: A Soft Robotics Installation for Tactile Storytelling
链接: https://arxiv.org/abs/2606.00418
作者: Carolina Silva-Plata,Abraham Villavicencio-Carmona,Miguel Silva Plata,Stefan Escaida,Ruben Fernandez
类目: Robotics (cs.RO); Human-Computer Interaction (cs.HC)
备注: 8 pages, 8 figures
Abstract:Soft robotics is increasingly explored in artistic contexts, where tactile interaction provides audiences with embodied engagement beyond visual or auditory signals. This work presents an interactive installation that maps semantic emotion analysis of narrative text into variable stiffness of soft pneumatic modules. A natural language model identifies two dominant emotions from a predefined set of six, driving the inflation of seven hexagonally arranged soft actuators. The central actuator represents the primary emotion, while the surrounding ones express the secondary. We develop and mechanically characterize silicone actuators, called soft modules, featuring a thin membrane layer, demonstrating how this morphological control expands the achievable stiffness range while preserving simplicity and low-cost fabrication. A user study with ten participants further evaluates how multisensory coupling of stiffness and LEDs intensity influences emotional perception. The results suggest that stiffness modulation accompanied by color change can support emotionally meaningful and engaging tactile interaction in soft robotic installations.
[HC-26] Agent ic Authoring of Interactive Multiview Visualizations in Genomics
链接: https://arxiv.org/abs/2606.00370
作者: Astrid van den Brandt,Kiroong Choe,Sehi L’Yi,Devin Lange,Nils Gehlenborg
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages, 12 figures
Abstract:Diverse genomics data, scientific questions, and analysis tasks typically demand highly specialized visualizations. Therefore, users often must customize or author new ones tailored to their data. Existing tools are usually either limited in customization or require substantial learning or programming, and even expressive tools assume visualization expertise many users lack. Agentic and large language model (LLM) approaches are increasingly applied to complex scientific tasks, including visualization. Natural-language conversational interfaces offer a promising path to democratizing the authoring of complex visualizations. In the context of genomics, these approaches face additional challenges: genomics visualizations typically integrate heterogeneous data types and are composed of multiple linked interactive views. These challenges motivate more structured LLM-based schemes. We first characterize where vanilla LLM generation succeeds and fails for genomics visualization, identifying eight quality dimensions. We then compare six schemes–direct generation, a fixed pipeline, and four agentic configurations varying in the number of specialist agents and the presence of a reviewer–across 159 cases spanning three levels of query ambiguity and specification complexity. All schemes use the Gosling visualization grammar as structured output. Agentic iteration substantially improves perceived quality over both baselines, while more complex agent architectures yield no additional benefit. We discuss implications for designing agentic systems for domain-specific visualization authoring. All supplemental materials are available at this https URL.
[HC-27] Effects of Varying LLM Access on Essay Writing Behavior
链接: https://arxiv.org/abs/2606.00250
作者: Julia Christenson,Karin de Langis,Shirley Anugrah Hayati,Dongyeop Kang
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: BEA (Building Educational Applications) Workshop 2026
Abstract:Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.
[HC-28] he New Social Image: How AI Competency and AI Proactivity Influence Self- and Peer-Perceptions in the Workplace
链接: https://arxiv.org/abs/2606.00182
作者: Kuntal Ghosh,Marc Hassenzahl,Shadan Sadeghian
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted for publication in Interacting with Computers (Oxford University Press)
Abstract:Human-AI collaboration is considered the most promising way to incorporate AI in the workplace. What remains unexplored are the experiential consequences of this teaming. More specifically, in a team with AI, how humans perceive themselves (self-perception) and how they are perceived by their coworkers (peer perception) in terms of work ownership and job meaningfulness. In a 2x2x2 vignette study (n=50), participants rated perceptions of ownership, affect, job meaningfulness and satisfaction, and role dynamics across two levels (low/high) of AI proactivity and AI competency as within-subject factors, with point-of-view (self perception/peer perception) as between-subjects. Our results showed that AI with low competency or low proactivity generally improved feelings related to ownership, meaningfulness, satisfaction, and role dynamics, and also increased positive affect while reducing negative affect. However, these effects were often influenced by point-of-view. For instance, low AI proactivity resulted in higher job satisfaction from self-perception rather than peer perception. Based on our findings, we argue that designing AI for the future of work solely around performance metrics may not be adequate. Highly competent and proactive AI-driven systems can have undesirable impacts on perceptions of ownership, job identity, social image and team dynamics, and consequently, job meaningfulness.
[HC-29] UF-AMA: A unified framework for cross-domain emotion recognition via adaptive multimodal alignment
链接: https://arxiv.org/abs/2606.00170
作者: Zheng Wang,Shuo Wang,Junhong Wang
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, emotion recognition based on physiological signals such as electroencephalogram (EEG) has gained considerable attention, as internal physiological data offer greater objectivity and reliability compared to external behavioral data like facial expressions. However, due to distribution shifts caused by individual and contextual differences, along with variations in sample quality across modalities, constructing a cross-domain multimodal emotion recognition model with high generalization and robustness remains a key challenge. In this study, we propose a Unified Framework with Adaptive Multimodal Alignment (UF-AMA) to address cross-subject and cross-session emotion recognition using multimodal physiological signals. First, we construct a cross-modal feature fusion network comprising Transformer encoders and multi-head cross-attention modules, enabling the deep integration of EEG signals and eye-tracking data. Subsequently, we introduce a confidence-aware screening mechanism that dynamically assesses the predictive reliability of each modality branch on target domain samples, partitions samples into different quality subsets, and accordingly applies global consistency alignment and cross-modal distillation. Finally, we propose a multi-level domain adaptation framework that jointly optimizes the marginal and conditional distributions of both local modality-specific and global fusion features, thereby reducing cross-domain distribution shifts at multiple granularities. Extensive experiments on the SEED and SEED-IV datasets demonstrate that UF-AMA achieves state-of-the-art (SOTA) performance in both cross-subject and cross-session tasks. The source code is available at: this https URL.
[HC-30] Agreement Metrics for LLM -as-Judge Evaluation: What to Report and Why
链接: https://arxiv.org/abs/2606.00093
作者: Delip Rao,Chris Callison-Burch
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Data Analysis, Statistics and Probability (physics.data-an)
备注: 12 pages
Abstract:Validating an LLM judge against human annotations usually means reporting several agreement statistics: accuracy, precision, recall, F_1 , Cohen’s \kappa , and one or more rank correlations. A survey of 24 recent LLM-as-judge papers finds metric choice entangled with the judgment scale, tie handling, invalid outputs, and abstention handling, and those choices rarely stated. For binary criteria – the common case in rubric-based evaluation, where each criterion is graded MET or UNMET – most of the reported numbers are redundant: Pearson’s r , Spearman’s \rho , Kendall’s \tau_b , the phi coefficient \phi , and the Matthews Correlation Coefficient all reduce to a single number on non-degenerate binary data, so reporting several of them only creates an illusion of corroborating evidence. Cohen’s \kappa is the one agreement coefficient that adds information: it shares \phi ‘s numerator but normalizes differently, and the gap between them measures how far the judge’s positive-label rate has drifted from the human’s. We then trace what changes when a judge may abstain with a CANNOT_ASSESS verdict: the three common ways of handling abstentions are not interchangeable preprocessing choices but answer different questions, and they break the binary equivalences. The same equivalences reappear, up to a negligible finite-sample correction, for multi-judge ensembles scored with Fleiss’ \kappa or Krippendorff’s \alpha . We close with a reporting checklist that names the judgment scale, the abstention and tie handling mode, coverage, the confusion matrix, and the aggregation level alongside any scalar agreement coefficient.
[HC-31] Beyond Categories of Caste: Examining Caste Bias and Morality in Text-to-Image AI Models
链接: https://arxiv.org/abs/2606.00039
作者: Divyanshu Kumar Singh,Dipto Das,Deepika Rama Subramanian,Koustuv Saha,Stephen Voida,Bryan Semaan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Text-to-Image (T2I) models have shown promising utility across various domains. However, such models are also amplifying harmful societal biases in their outputs. In the context of South Asia, recent work has shown caste biases and stereotypes are being perpetuated through Generative AI (GenAI) systems. While this research offers extremely relevant insight into invisibilized narratives of caste discrimination through the GenAI system, they often treat caste as an identity category. Therefore, in this work we shift our ontology to focus on the relational aspect of caste. This enables us to develop a more nuanced understanding of the mechanics of caste discrimination by and through T2I models. Combining an algorithmic audit with critical discourse analysis, we draw on a conceptual frame challenging Brahminical Normativity to show how caste biases are perpetuated beyond the simple binaries of upper vs lower-caste categories. Our contributions are two-fold. Beyond challenging the categorical understanding of caste as a category, we propose an anti-caste approach to tackle the issue of caste bias and fairness in AI systems.
[HC-32] Update Opacity: Epistemic Accessibility and Governance Under AI System Change
链接: https://arxiv.org/abs/2606.00037
作者: Andrea Ferrario,Joshua Hatherley
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Machine learning models embedded in deployed AI systems are routinely updated to maintain correct functioning over time. Yet such updates can generate update opacity: users may not be able to understand why the same input now yields a different output. We argue that update opacity is best understood as a diachronic failure of epistemic accessibility: the problem is that materially relevant changes may fail to remain accessible to human users in forms that support understanding, calibrated reliance, and appropriate action under real role- and time-specific constraints. This makes update opacity a governance problem. Not all change is equally relevant, and disclosing every update would itself undermine use through overload. To address this problem, we combine two complementary governance approaches: the EU AI Act, which helps specify the system-level perimeter of normatively relevant change, and Machine Learning Operations, which provides operational tools for tracking and comparing change over time. On this basis, we propose a framework that models system change through trustworthiness profiles and trustworthiness levels, and uses threshold-based disclosure to surface materially relevant within-envelope change to different stakeholders over time. We illustrate the approach with a medical AI example and derive practical implications for lifecycle documentation, post-market monitoring, and update disclosure.
[HC-33] Redistributing Voice and Responsibility: AI in Relationship-Centred Care ALT
链接: https://arxiv.org/abs/2606.00028
作者: Kellie Yu Hui Sim,Kenny Tsu Wei Choo
类目: Human-Computer Interaction (cs.HC)
备注: Provocation accepted to the CHI 2026 Workshop on Toward Relationship-Centered Care with AI: Designing for Human Connections in Healthcare. 5 pages
Abstract:Relationship-centred care (RCC) recognises that healthcare quality depends not only on outcomes, but on how voice, responsibility, and emotional labour are negotiated among patients, caregivers, and providers. As AI systems enter sensitive care contexts, they introduce a new participant into these negotiations. Drawing on empirical work in Advance Care Planning (ACP) and peer support, we argue that AI’s primary impact in high-subjectivity domains is not optimisation but redistribution: it reorganises who speaks, who decides, and who bears moral responsibility. Across both settings, participants were less concerned with technical accuracy than with relational consequences: whether AI would appropriately represent their decision, reduce burden, or blur accountability, scaffold connection, or subtly displace it. We identify three relational dimensions: authority, temporality, and visibility, through which AI reshapes care relationships, and propose design provocations centred on relational legibility, bounded agency, responsibility traceability, and non-substitutive scaffolding.
[HC-34] Navigating Independence: A Survey of Visually Impaired Peoples Experiences and Needs
链接: https://arxiv.org/abs/2606.00025
作者: Banafshe Marziyeh Bamdad,Manuel Günther,Alireya Darvishy
类目: Human-Computer Interaction (cs.HC)
备注: 8 pages, 2 figures, 6 Tables. This paper has been accepted for presentation at the ICCHP 2026 conference (International Conference on Computers Helping People with Special Needs)
Abstract:Independent navigation in unfamiliar environments remains a major challenge for blind and visually impaired individuals, despite the availability of assistive technologies. This paper presents the results of a fully accessible online survey investigating navigation experiences, challenges, and technology preferences among people with visual impairments worldwide. The survey was distributed through individuals and organizations supporting visually impaired communities. Our results indicate that smartphone-based applications are the most used digital navigation aids, while a substantial proportion of participants report not using any assistive navigation technology due to cost, accessibility, or usability barriers. Participants reported persistent difficulties in obstacle detection, wayfinding, and navigation in complex environments. Despite a widespread focus on smartphone-based solutions, they expressed a clear preference for wearable and hands-free systems, highlighting a gap between current technology use and user needs. The findings provide a user-centered overview of navigation needs and offer insights into the design and evaluation of future assistive navigation systems.
[HC-35] Understanding Stigmatizing Language in Clinical Documentation: A Paired Comparison of Ambient AI Drafts and Clinician Finalized Notes
链接: https://arxiv.org/abs/2606.00019
作者: Yiliang Zhou,Yawen Guo,Sairam Sutari,Jasmine Dhillon,Alexandra L. Beck,Emilie Chow,Steven Tam,Danielle Perret,Deepti Pandita,Gelareh Sadigh,Archana J. McEligot,Kai Zheng
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Ambient artificial intelligence (AI) documentation tools are increasingly deployed to reduce clinician documentation burden, but their implications for biased language in clinical notes remain unclear. We conducted a large-scale comparison analysis of AI drafts and corresponding clinician finalized notes to quantify stigmatizing language changes pre- and post-editing. Using a lexicon-based natural language processing (NLP) pipeline, we measured (1) the prevalence of stigmatizing language in AI drafts, (2) the prevalence and term composition in final notes, and (3) the frequency of removal or introduction of stigmatizing terms. Across 66,297 paired note sections, 21.4% of AI draft sections contained at least one stigmatizing language mention, rising to 24.0% in clinician finalized versions. Introductions occurred more often than removals, suggesting clinician editing can be a net source of stigmatizing language entering the EHR with using Ambient AI.
[HC-36] Examine Clinicians Modification of Hedging Language in Ambient AI Documentation: A Comparative Study of AI Drafts and Final Notes
链接: https://arxiv.org/abs/2606.00018
作者: Yiliang Zhou,Yawen Guo,Di Hu,Sairam Sutari,Emilie Chow,Steven Tam,Danielle Perret,Deepti Pandita,Kai Zheng
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Ambient AI documentation systems generate clinical note drafts that clinicians frequently revise before signing off into electronic health records, yet how these edits alter hedging language remains unclear. We conducted paired analysis of clinician-edited portions of ambient AI drafts and final notes to examine (1) whether these edits change the prevalence of hedging language, (2) whether these edits exhibit a systematic shift toward greater certainty or uncertainty, and (3) whether these changes in hedging prevalence and directionality differ by ambient AI vendors and clinical specialties. Among 62,811 paired note sections, hedging terms were more often introduced into previously non-hedged text than removed from previously hedged text, and post-edit text contained more hedging mentions than pre-edit text. Directionality analyses showed a significant overall tendency toward greater uncertainty in hedging-related replacement edits. Vendor and specialty analyses revealed substantial heterogeneity in hedging prevalence, pre-to-post changes in hedging mentions, and directionality.
[HC-37] SortingHat: Redefining Operating Systems Education with a Tailored Digital Teaching Assistant
链接: https://arxiv.org/abs/2606.00015
作者: Yifan Zhang,Xinkui Zhao,Zuxin Wang,Zhengyi Zhou,Guanjie Chen,Shuiguang Deng,Jianwei Yin
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Emerging Technologies (cs.ET)
备注:
Abstract:Operating Systems (OS) courses are among the most challenging in computer science education due to the complexity of internal structures and the diversity of running environments. Traditional teaching methods often fail to address the diverse backgrounds, learning speeds, and practical needs of students. To tackle these challenges, we present SortingHat, a personalized digital teaching assistant tailored specifically for OS education. SortingHat integrates advanced AI technologies, including a retrieval augmented generation (RAG) framework and multi agent reinforcement learning (MARL), to deliver adaptive, scalable, and effective educational support. SortingHat features a 3D digital human interface powered by large language models (LLMs) to provide personalized, empathetic, and context aware guidance. It generates tailored exercises based on each student’s learning history and academic performance, reinforcing weak areas and challenging advanced concepts. Additionally, the system incorporates a robust evaluation pipeline that ensures fair, consistent, and unbiased grading of student submissions while delivering personalized, actionable feedback for improvement. By combining personalized guidance, adaptive content creation, and automated assessment, SortingHat transforms OS education into an engaging, immersive, and scalable experience.
[HC-38] A phenomenon of AI-conformity: how algorithms change human moral decision-making
链接: https://arxiv.org/abs/2606.00013
作者: Yana Venerina,Dmitry Koch,Nare Meloyan,Gerda Prutko,Valeriia Lelik,Victoria Taova,Andrey Kurpatov
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 31 pages, 1 figure
Abstract:Social conformity is a well-documented phenomenon in which individuals shift their opinions towards those of a social majority. As artificial intelligence (AI) becomes increasingly integrated into everyday life it may also create a novel source of influence giving rise to algorithmic conformity, mechanisms of which are poorly understood. The present study examined whether AI judgements affect moral decision-making in humans (n=165) adapting the classical Asch paradigm. Participants completed a series of moral dilemmas under three different conditions: in presence of social majority, with an AI model providing brief answers and with an AI model providing both answers and explanations of its choices. In all conditions the presented responses contradicted generally accepted moral norms. The results indicated that an AI model with a reasoning component affected the opinion of participants to a degree comparable to that of a human majority. These findings suggest that even moral judgements, despite their sensitivity and personal significance, may be susceptible to algorithmic conformity. However, the mechanism underlying algorithmic conformity appears to differ from the social one. Overall, the study challenges the assumption that moral decision-making lies in “AI inadmissibility zone” - a sphere that is considered as an area in which only human-made decisions are acceptable and highlights the need for a further investigation of this phenomenon as AI-based recommendations become increasingly embedded into human decision-making.
[HC-39] RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview
链接: https://arxiv.org/abs/2606.00011
作者: Min Hun Lee,Justin Yu Feng Teo
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16% ( p0.001 ) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants’ feedback for model adaptation, increasing post-update local performance gains from 11.50% to 36.38% after incorporating users’ rule-based feedback ( p0.001 ). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.
[HC-40] Empathic and agent ic artificial intelligence in nursing: perspectives on a human-centered framework for cancer care navigation in the United States
链接: https://arxiv.org/abs/2606.00010
作者: Tyra Girdwood,Saba Kheirinejad,Parnian Kheirkhah Rahimabad,Brianna M. White,Robert L Davis,David L Schwartz,Arash Shaban-Nejad
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5 Pages, 1 Figure, 1 Table
Abstract:For patients experiencing cancer, nurse navigation can ease the burden of complex care by enhancing coordination of health services and patient outcomes. However, in under-resourced areas, trained nurse navigators may be limited or non-existent. In the United States, artificial intelligence (AI)-enabled digital health tools are increasingly available and may help address gaps in care coordination; however, most are not designed to specifically support nursing. This perspective piece discusses a human-centered AI framework that integrates empathic and agentic approaches grounded in the American Nurses Association’s code of ethics to support nurses in the United States in cancer care navigation. The framework could augment, not replace, human empathy and agency while improving nurse workflow, patient-clinician relationships, and care coordination services in under-resourced areas.
[HC-41] Know Your Author: Does the AI Penalty Hold in Short Fiction?
链接: https://arxiv.org/abs/2606.00006
作者: Michael Todasco,Joselyn Cesare
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 11 pages, 2 figures. Preregistered experiment (N=254). Data, materials, and analysis code available at OSF
Abstract:Public concern about an “AI penalty” suggests that labeling content as AI-generated may negatively influence how it is evaluated. We tested this claim in a preregistered experiment (N = 254, per protocol) using a pure attribution design: participants read one of two ~200-word vignettes and were randomly assigned to see it labeled as Human-written, AI-written, or presented with no author line. Authorship labels did not produce reliable main effects on creativity, enjoyment, recommendation, or originality; observed effect sizes were uniformly small. However, labels strongly influenced inferred effort: participants estimated that Human-labeled stories took far longer to create than AI-labeled stories (back-transformed geometric means from ln[minutes + 1]: 148 vs. 6 minutes). Across conditions, higher inferred effort predicted greater enjoyment, and this relationship was also present within the AI-labeled condition. Additionally, participants’ prior attitudes toward AI moderated recommendation judgments: more positive attitudes were associated with higher recommendation ratings for AI-labeled stories, but not for Human-labeled stories. These findings suggest that while AI authorship labels do not systematically alter average evaluations of short fiction, they meaningfully shape perceptions of effort and interact with prior beliefs to influence downstream judgments.
[HC-42] Navigating Culture in Smart Port Cities: Cultural Sensitivity and Digital Engagement Among Sailing Tourists in the Mediterranean
链接: https://arxiv.org/abs/2606.00004
作者: Panagiota Konstantinou,Georgios Stathakis
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注: 12 pages, 1 figure, 10 tables
Abstract:This study examines the relationship between smart port city infrastructure, tourists or crew cultural sensitivity and digital engagement among international sailing tourists in the Mediterranean and particularly in Greece. It is based on an interdisciplinary literature synthesis and primary data from a survey conducted with a total of 203 respondents over three sailing seasons. This paper proposes a conceptual framework that positions cultural sensitivity as a result of the interaction between smart port destination technology, tourist awareness and their engagement with the local community. Among the findings, high levels of adoption of digital platforms for logistical purposes such as, while culturally oriented digital tools remain underused. A significant discrepancy is found between tourists cultural sensitivity and their practical uncertainty in real cultural situations. Thus highlighting an unmet need of potential visitors for real-time cultural guidance tools. Tourists from distant cultures found to be significantly higher among the entire sample of tourists. The evidence that tourists seek a culturally integrated smart port application is strong, particularly among tourists who experienced the highest levels of uncertainty. The study contributes both conceptual and empirical evidence to the smart cities literature, with practical implications for port planners, tourism policy makers, and digital platform designers.
[HC-43] Learning from Mistakes: Can LLM Self-Recover after Misalignment? AAAI’26
链接: https://arxiv.org/abs/2606.00003
作者: Olga E. Sorokoletova,Francesco Giarrusso,Vincenzo Suriani,Daniele Nardi
类目: Computers and Society (cs.CY); Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC)
备注: AAAI’26 Workshop (WS37), Machine Ethics: from formal methods to emergent machine ethics, January 20–27, 2026, Singapore
Abstract:Responsible AI initiatives place great emphasis on the safety of Large Language Model (LLM)-based systems. In particular, it has become standard practice to subject these models to an alignment procedure aimed at preventing harmful outputs. However, once aligned, a model is not guaranteed to maintain this alignment throughout its lifecycle. Moreover, the likelihood of misalignment increases as malicious actors may deliberately employ jailbreaking techniques to compromise LLM safety. To counter this, much research has focused on improving alignment methods and post-processing filters. In this paper, we introduce a new perspective on advancing LLM alignment: rather than developing stronger alignment techniques, we investigate the model’s intrinsic ability to recover its alignment after corruption. We propose a methodology for modeling the safety trajectories of user-assistant interactions and for detecting recovery trends within them. We apply this approach to a jailbreaking scenario, presenting a preliminary recovery analysis based on a dataset of adversarial multi-turn dialogues and examining the influence of the content moderation model chosen for safety evaluation. Project page with an interactive data visualizer is available at this https URL.
[HC-44] Shu Dao: A Calligraphy Score Framework Linking Calligraphy Music and Performance
链接: https://arxiv.org/abs/2606.00001
作者: Lican Huang
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 47 pages
Abstract:This paper introduces Calligraphy Writing Score Representation (CWSR) and proposes Shu Dao as a framework that interprets East Asian calligraphy as a performative art rather than a static visual artifact. Inspired by traditions such as Japanese Shodō and embodied cultural practices such as Chadao , the framework models calligraphy as a structured performance analogous to musical notation. Instead of representing characters as fixed images, the proposed approach encodes each brush stroke as an ordered and executable action, forming a calligraphy score. Characters are organized within a structured spatial grid, and strokes are annotated with attributes including stroke type, execution order, spatial coordinates, trajectory, compositional role, and dynamic properties such as brush pressure and pacing. This representation captures temporal and expressive aspects of calligraphic writing that are typically absent from image-based representations. The paper makes three main contributions. First, it introduces CWSR as a structured notation system for representing calligraphy across multiple levels, including strokes, character structures, and compositional organization (e.g., layout and zhangfa), together with their rhythmic and performative dynamics. Second, it conceptualizes Shu Dao as a score-mediated framework that models calligraphy as structured performance. Third, it establishes a computational foundation for the analysis, visualization, and executable generation of calligraphic works by AI-based calligraphic agents. Together, these contributions bridge calligraphy, musical notation, and performative cultural practices, supporting human–AI co-creation in computational calligraphy and digital humanities research. Comments: 47 pages Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2606.00001 [cs.HC] (or arXiv:2606.00001v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2606.00001 Focus to learn more arXiv-issued DOI via DataCite Journalreference: Journal of Advances in Information Science and Technology, 2026 4(2), 1-47. https://yvsou.com/journal/index.php/jaist/article/view/43 Submission history From: Lican Huang [view email] [v1] Tue, 24 Mar 2026 20:12:00 UTC (472 KB)
[HC-45] Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding ICLR2026
链接: https://arxiv.org/abs/2606.02305
作者: Matteo Ciferri,Tommaso Boccato,Michal Olak,Matteo Ferrante,Nicola Toschi
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC)
备注: Presented at ICLR 2026 Workshop on Representational Alignment (Re-Align)
Abstract:Understanding how speech foundation models relate to human cortical activity is a key challenge for computational neuroscience. Here, we investigate how internal representations from Whisper predict intracranial ECoG responses during naturalistic speech perception. We introduce a time-resolved neural encoder that combines speech embeddings with a recurrent temporal model and soft attention, allowing us to examine layer-wise brain alignment. Intermediate Whisper layers provide the strongest correspondence with neural activity, supporting a hierarchical match between model representations and cortical speech processing. Comparisons with baselines show that high-resolution ECoG responses benefit from temporally structured modelling beyond linear mappings from the same speech representations. In addition, attention maps reveal temporally local alignment between speech embeddings and neural responses, while a phonemic interpretability analysis identifies anatomically coherent phoneme-category organization among encoding-informative electrodes. Together, these results suggest that speech foundation models offer a useful framework for studying time-resolved cortical speech representations.
[HC-46] A 1000-hour EEG-EMG-audio dataset of Japanese speech production
链接: https://arxiv.org/abs/2606.01264
作者: Motoshige Sato,Ilya Horiguchi,Masakazu Inoue,Kenichi Tomeoka,Eri Hatakeyama,Yuya Kita,Atsushi Yamamoto,Ippei Fujisawa,Shuntaro Sasai
类目: Neurons and Cognition (q-bio.NC); Human-Computer Interaction (cs.HC); Sound (cs.SD); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:
Abstract:We present a multimodal dataset of 1020 hours of simultaneously recorded scalp electroencephalography (EEG), facial electromyography (EMG), and speech audio from three healthy native Japanese speakers during open-vocabulary overt speech. Recordings were acquired with three EEG systems-an ultra-high-density system (this http URL) and two cap-type systems (this http URL and eegosports), spanning 62-128 channels-across many sessions over several months. Each session provides time-synchronized EEG, facial EMG, and audio, together with speech-event annotations and transcriptions. Although collected with speech decoding as a primary motivation, the dataset also supports work on multimodal signal processing, artifact modeling, longitudinal and cross-device adaptation, and EEG representation learning. Technical validation included power spectral density and event-related potential analyses across participants, devices, and tasks, which showed the expected 1/f spectral profile, task-related alpha-band attenuation, and time-locked evoked responses. The dataset is released in Brain Imaging Data Structure (BIDS) format via OpenNeuro under a CC0 waiver to support both speech-related and broader EEG research.
[HC-47] A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces
链接: https://arxiv.org/abs/2606.00106
作者: Javier Jiménez,Francisco B Rodríguez
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by \alpha, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning \alpha yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.
计算机视觉
[CV-0] hinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
链接: https://arxiv.org/abs/2606.02580
作者: Guangzhao He,Rundong Luo,Wei-Chiu Ma,Hadar Averbuch-Elor
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Inverse graphics is a longstanding and highly underconstrained problem that seeks to reconstruct images as editable 3D scenes which can be rendered, relit, and manipulated. In this work, we investigate whether pretrained vision-language models (VLMs) can perform executable inverse graphics directly from a single image by reconstructing a scene as an editable Blender program, without relying on specialized 2D or 3D foundation models, differentiable rendering, or multi-view supervision. We introduce Staged Executable Inverse Graphics (SEIG), an agentic framework that reconstructs a 3D scene from a single image by progressively refining scene factors including geometry, materials, composition, and lighting directly in executable Blender code space. We evaluate our framework across diverse scenes using a range of reconstruction metrics spanning pixel-level, perceptual, and semantic fidelity. Our experiments show that staged reconstruction substantially improves reconstruction fidelity, highlighting the importance of task decomposition for executable inverse graphics with general-purpose VLMs. Finally, we showcase various downstream applications enabled by the reconstructed editable Blender scenes.
[CV-1] Mitigating Perceptual Judgment Bias in Multimodal LLM -as-a-Judge via Perceptual Perturbation and Reward Modeling ICML2026
链接: https://arxiv.org/abs/2606.02578
作者: Seojeong Park,Jiho Choi,Junyong Kang,Seonho Lee,Jaeyo Shin,Hyunjung Shim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Recent multimodal large language models have demonstrated strong reasoning ability, yet their reliability as automated evaluators remains limited by a critical weakness: when visual evidence conflicts with textual cues, MLLM judges tend to reward plausible narratives over perceptually correct answers. We identify and systematically analyze this phenomenon, which we term Perceptual Judgment Bias. Through controlled visual perturbations, existing multimodal judges frequently anchor on the response text instead of their own visual perception, leading to inconsistent and non-verifiable evaluations. To address this issue, we introduce the Perceptually Perturbed Judgment Dataset, which constructs minimally edited counterfactual responses that isolate perceptual errors and enable verifiable supervision. Building on this dataset, we develop a unified training framework that combines a structured GRPO-based reward with a batch-ranking objective, achieving coherent global ordering without explicit pairwise labels. Experiments across diverse MLLM-as-a-Judge benchmarks show that our approach substantially improves perceptual fidelity, ranking coherence, and alignment with human evaluation. Our results establish a scalable and generalizable pathway for training multimodal judges that are perceptually grounded, interpretable, and robust to visual-reasoning conflicts.
[CV-2] RoboDream: Compositional World Models for Scalable Robot Data Synthesis
链接: https://arxiv.org/abs/2606.02577
作者: Junjie Ye,Rong Xue,Basile Van Hoorick,Runhao Li,Harshitha Rajaprakash,Pavel Tokmakov,Muhammad Zubair Irshad,Vitor Guizilini,Yue Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Scaling robot learning requires large-scale, diverse demonstrations, yet real-world data collection via teleoperation remains prohibitively expensive and time-consuming. While video diffusion models offer a promising avenue for data scaling, existing generative approaches are often limited to superficial visual augmentation, or suffer from embodiment hallucinations that yield physically infeasible motions. We present a generalizable embodiment-centric world model that achieves scalable data generation by synthesizing photorealistic demonstrations with novel objects, in novel scenes, and from novel viewpoints. Our approach anchors generation to rendered robot motion while conditioning on explicit scene and object priors, effectively decoupling trajectory execution from environment synthesis. This formulation has the potential to unlock two powerful data scaling capabilities: (1) retrieval and rebirth, which repurposes existing trajectories into entirely new contexts without new motion data; and (2) prop-free teleoperation, where operators manipulate empty air and the model hallucinates the target objects and scene afterwards, eliminating reset time. We demonstrate with real-world experiments that our generated data consistently improves downstream policy performance and significantly reduces real-world data requirements across diverse manipulation tasks.
[CV-3] ProtoAda: Prototype-Guided Adaptive Adapter Expansion and Geometric Consolidation for Multimodal Continual Instruction Tuning
链接: https://arxiv.org/abs/2606.02576
作者: Yu-Cheng Shi,Zhen-Hao Xie,Jun-Tao Tang,Da-Wei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, but real-world deployment requires them to continually acquire new vision-language capabilities, making Multimodal Continual Instruction Tuning (MCIT) essential. To reduce inter-task interference and promote collaboration, recent methods often employ sparse architectures like Mixture of LoRA Experts with image-text similarity routing. However, tasks with distinct response structures could share highly similar visual-linguistic semantics and thus be wrongly routed to the same expert; image-text similarity alone is insufficient for reliable task assignment. For example, an expert in a grounding task requiring coordinate prediction may be biased toward producing short textual answers after learning semantically similar VQA tasks. This format-blind task assignment integrates heterogeneous response types into shared parameters, inducing gradient interference and ineffective expert collaboration. To address this problem, we propose ProtoAda, a prototype-guided adaptive tuning framework. ProtoAda introduces format-aware task prototypes to align task assignment and routing with both task semantics and output structure, and further consolidates format-compatible updates in a geometry-aware manner to effectively reuse and progressively refine existing parameters. Extensive experiments on multiple benchmarks demonstrate that ProtoAda achieves superior performance, especially on tasks whose answer structures are easily corrupted by sequential tuning.
[CV-4] From Zero to Hero: Training-Free Custom Concept Spawning in World Models
链接: https://arxiv.org/abs/2606.02575
作者: Kiymet Akdemir,Pinar Yanardag
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive world models have emerged as a powerful paradigm for interactive video generation, allowing users to navigate dynamically generated environments through actions. These models are typically conditioned on a text prompt and/or a single reference frame, from which the entire world is generated. Yet the moment the user navigates beyond what is visible in that frame, the unseen regions are populated by the base model’s priors, with no mechanism for the user to specify what should appear and where. This is a fundamental limitation for applications such as gaming, interactive storytelling, and simulation, where controllable scene composition is essential. We refer to this missing capability as concept spawning; introducing a user-specified visual concept into a world model, analogous to spawning in a game engine. We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. SPAWN exploits a structural property of image-to-video backbones: the first slot of the context memory is pinned to the reference frame and acts as a foundational anchor for every generated chunk. By swapping this anchor with an external concept latent over a short injection window and letting the original anchor return, we cause the concept to propagate naturally through the rollout via the model’s own memory. SPAWN supports concepts from fine-grained entities such as characters and props to large-scale elements such as buildings and landmarks, and accepts either a concept image or a text description as input. Experiments show that SPAWN integrates concepts with consistent lighting, scale, and perspective while preserving identity and temporal coherence, demonstrating that controllable concept spawning is achievable in existing autoregressive world models without any training.
[CV-5] HumanNOVA: Photorealistic Universal and Rapid 3D Human Avatar Modeling from a Single Image CVPR2026
链接: https://arxiv.org/abs/2606.02573
作者: Hezhen Hu,Wangbo Zhao,Lanqing Guo,Hanwen Jiang,Jonathan C. Liu,Zhiwen Fan,Kai Wang,Zhangyang Wang,Georgios Pavlakos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Highlight
Abstract:In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at this https URL .
[CV-6] VISReg: Variance-Invariance-Sketching Regularization for JEPA training
链接: https://arxiv.org/abs/2606.02572
作者: Haiyu Wu,Randall Balestriero,Morgan Levine
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Self-supervised learning methods prevent embedding collapse via modeling heuristics or explicit regularization of the embedding space. Among the latter, VICReg decomposes regularization into variance and covariance objectives, offering flexibility and interpretability. However, covariance captures only second-order statistics – encouraging decorrelation but failing to enforce the full distributional shape needed for stable training. Sketching-based methods such as SIGReg address this by aligning embeddings to an isotropic Gaussian, but lack flexibility and suffer from vanishing gradients under collapse. We propose Variance-Invariance-Sketching Regularization (VISReg), which replaces covariance with a Sliced-Wasserstein-based sketching objective that enforces full distributional shape, while retaining a variance term for scale control. By decoupling scale and shape, VISReg combines VICReg’s flexibility with the distributional rigor of sketching methods, providing robust gradients even under collapse. We show that VISReg scales linearly, outperforms existing regularization on low-quality datasets, and is resilient to long-tailed and low-rank regimes. Pre-trained on ImageNet-1K, VISReg achieves state-of-the-art performance on out-of-distribution datasets. Pre-trained on ImageNet-22K, it matches DINOv2’s OOD performance despite the latter using 10x more data (LVD-142M). Project and code: this https URL.
[CV-7] Policy-based Foveated Imaging and Perception
链接: https://arxiv.org/abs/2606.02565
作者: Howard Xiao,Jan Ackermann,Boyang Deng,Gordon Wetzstein
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website at this https URL
Abstract:Ultra-high-resolution image sensors offer the potential to capture fine spatial details critical for many visual perception tasks, but acquiring and processing all pixels at full resolution is often infeasible under realistic bandwidth, latency, and power constraints. Existing approaches address this challenge through acquisition strategies such as spatial or temporal downsampling, which irrevocably discard information before task relevance can be assessed. In this work, we introduce a real-time, predictive, and task-aware foveated imaging system that operates directly at image acquisition time. Leveraging emerging dual-stream sensor architectures, our method dynamically allocates limited pixel bandwidth to task-relevant regions of interest while maintaining a low-resolution global context. We formulate foveated acquisition as a sensor attention policy-learning problem, in which past observations guide actions that determine future measurements, closing the perception-acquisition loop. Through extensive simulation across multiple perception tasks, we demonstrate that our approach achieves high task performance under strict pixel budgets and significantly outperforms relevant baselines operating at the same bandwidth. We further validate our system on a 200-megapixel dual-stream sensor, capturing real-world videos under realistic bandwidth and latency constraints, demonstrating the practical feasibility of task-driven, acquisition-time foveated imaging.
[CV-8] VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
链接: https://arxiv.org/abs/2606.02564
作者: Junhao Cheng,Liang Hou,Tianxiong Zhong,Xin Tao,Pengfei Wan,Kun Gai,Jing Liao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:The recent “Reasoning with Video” paradigm utilizes Video Generation Models (VGMs) to generate temporally coherent visual trajectories to complete reasoning tasks. Although state-of-the-art VGMs excel at visual quality, they often struggle to understand and follow task-specific rules, leading to logical failures across diverse reasoning scenarios. Existing efforts try to utilize Vision-Language Models (VLMs) as problem pre-solvers to produce or refine textual guidance for the VGM. However, textual descriptions fail to capture intricate spatiotemporal details, and VGMs often struggle to faithfully execute fine-grained or long-tail instructions even with a valid plan. While VLMs struggle as solvers, they possess strong perception capabilities to evaluate process-constraint satisfaction and final-goal achievement. Leveraging this strength, we introduce a paradigm shift that transitions the role of VLMs to “teachers”. Specifically, a VLM teacher extracts task-specific rules to formulate differentiable rewards, guiding a VGM Reasoner via test-time online optimization of a lightweight LoRA module. This strategy enables adaptive test-time optimization and extends the reasoning capabilities beyond the VGM’s intrinsic boundaries. Evaluations on symbolic (VBVR-Bench) and general-purpose (RULER-Bench) video reasoning benchmarks show that the proposed method yields a 16.7-point average performance gain, outperforming the VLM-as-Solver paradigm (+0.4 points) and Best-of-N scaling (+2.2 points) by a large margin at comparable test-time cost. These findings reveal that integrating VLMs as test-time teachers offers a promising paradigm for achieving generalizable video reasoning. Project Page: this https URL
[CV-9] LongLive-RAG : A General Retrieval-Augmented Framework for Long Video Generation
链接: https://arxiv.org/abs/2606.02553
作者: Qixin Hu,Shuai Yang,Wei Huang,Song Han,Yukang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 7 figures, 4 tables
Abstract:Autoregressive (AR) video diffusion enables variable-length synthesis, but long-horizon generation often suffers from accumulated errors and identity drift. For efficiency, existing methods commonly adopt sliding-window attention during generation. This creates an irreversible generation trajectory: once the active window accumulates appearance errors, subsequent generations can only condition on this degraded trajectory and drift further away. We address this limitation by formulating long video generation as a retrieval-augmented generation (RAG) problem. Rather than relying solely on the recent window, we treat previously generated latents as a dynamic, searchable history. We propose LongLive-RAG, a general retrieval framework for AR video generation. At each new block, LongLive-RAG uses a query embedding to retrieve relevant historical latents. This lightweight retrieval step adds only a small overhead relative to generation and lets the generator condition on non-local context instead of only the recent window. To make retrieval more discriminative, we introduce the Window Temporal Delta Loss that suppresses redundant local similarity and encourages embeddings to capture meaningful temporal changes. Together, these components help reduce error accumulation caused by sliding-window attention. Experiments across multiple AR backbones and generation lengths show improved long-video quality and the best average VBench-Long rank. To our knowledge, among open-ended AR long video generation methods, LongLive-RAG is the first to formulate self-generated latent history as content-addressable retrieval memory. Code is available at this https URL.
[CV-10] Modeling Depth Ambiguity: A Mixture-Density Representation for Flying-Point-Free Depth Estimation
链接: https://arxiv.org/abs/2606.02552
作者: Siyuan Bian,Congrong Xu,Jun Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite advances in depth estimation, flying points remain a persistent failure mode: near object boundaries, depth estimators often predict spurious 3D points in the empty space between foreground and background surfaces. We trace this artifact to a standard modeling choice: assigning each pixel a single depth hypothesis. At boundaries, a pixel can straddle a foreground and a background surface, so its true depth is ambiguous between the two. A model that predicts a single depth cannot keep both possibilities, so training instead pulls the prediction toward an intermediate depth that lies on neither surface. We address this with MDA, a mixture-density representation that lets the model predict multiple depth hypotheses and their associated probabilities for each pixel. Near boundaries, different hypotheses can align with different surfaces, and the decoded depth is selected from one of these hypotheses rather than placed in the empty space between them. Across different backbones, MDA substantially improves boundary reconstruction and largely removes flying-point artifacts even under severe input blur, while adding negligible runtime overhead. The same mixture-density framework naturally extends to transparent objects, where it predicts multiple depth layers at transparent pixels, and to sky regions, where a dedicated component separates the unbounded sky from finite-depth regions, producing flying-point-free skylines. Project Page: this https URL.
[CV-11] AFUN: Towards an Affordance Foundation Model for Functionality Understanding
链接: https://arxiv.org/abs/2606.02551
作者: Zhaoning Wang,Yi Zhong,Jiawei Fu,Henrik I. Christensen,Jun Gao
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7–61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: this https URL
[CV-12] LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
链接: https://arxiv.org/abs/2606.02535
作者: Lu Liu,Huiyu Duan,Chenxin Zhu,Jintong Lu,Haoyun Jiang,Liu Yang,Qiang Hu,Guangtao Zhai,Xiaoyun Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbfLL-Bench, a comprehensive \textbfBenchmark for evaluating the capabilities of large-scale generative models on \textbfLow-\textbfLevel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbfLL-Score, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.
[CV-13] Improving Combined Detection and Classification of TEM Defects via Mask-Conditioned Latent Diffusion Augmentation
链接: https://arxiv.org/abs/2606.02532
作者: Ni Li,Nuohao Liu,Ryan Jacobs,Ajay Annamareddy,Maciej P. Polak,Kevin Field,Izabela Szlufarska,Dane Morgan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Analyzing microstructural defects in transmission electron microscopy (TEM) images, particularly in irradiated metal alloys, is often limited by the availability of high-quality, labeled data. To address this, we introduce a generative data augmentation approach using a mask-conditioned latent diffusion model (LDM) for synthesizing realistic TEM images with controllable, automatically labeled multi-class defect masks. Without requiring manual annotations for generation, our method enables the creation of synthetic image-mask pairs by sampling distributions learned from experimental masks. These generated data were used to augment small experimental datasets of varying sizes (10, 50, and 100 labeled experimental images) to train a Mask Regional Convolutional Neural Network (R-CNN) model for defect detection and classification. Our results show that generative augmentation yields small overall model performance improvements, with up to a 0.02 gain in the harmonic mean of detection and classification F1 scores. However, we also find that the relative contributions to detection and classification improvement depend on the specific train/test data split. These findings highlight the potential of targeted generative models to enhance deep learning performance in data-scarce microscopy-based image quantification tasks.
[CV-14] Why Not Hyperparameter-Friendly Optimisation? A Monotonic Adaptive Norm Rescaling Approach For Long-Tailed Recognition
链接: https://arxiv.org/abs/2606.02526
作者: Shuo Zhang,Chenqi Li,Tingting Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-tailed recognition poses a significant challenge for deep learning. The two-stage decoupling paradigm, which separates representation learning from classifier retraining, offers a promising solution. During the classifier retraining stage, adaptive norm rescaling is a popular technique. It adjusts the per-class weight norms via parameter regularization, which inevitably introduces hyperparameters. However, many studies report that long-tailed recognition is sensitive to these hyperparameters, as their setup significantly impacts performance. In this paper, we first provide a class-conditional distribution perspective to support norm rescaling methods. Furthermore, we propose a simple but effective approach called Self-Adaptive Monotonic Normalization (SAMN). SAMN avoids the need for parameter regularization. It directly enforces monotonicity on per-class weight norms using the Pool Adjacent Violators Algorithm, making the method hyperparameter-friendly. SAMN is a universal strategy that integrates seamlessly with other methods for enhanced performance. Experiments on benchmark datasets demonstrate that our method significantly boosts long-tailed recognition performance, often achieving state-of-the-art results.
[CV-15] Moment-Video: Diagnosing Temporal Fidelity of Video MLLM s on Momentary Visual Events
链接: https://arxiv.org/abs/2606.02522
作者: Xiaolin Liu,Yilun Zhu,Xiangyu Zhao,Xuehui Wang,Yan Li,Xin Li,Haoyu Cao,Xing Sun,Shaofeng Zhang,Xu Yang,Zhihang Zhong,Xue Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 pages, 10 figures, 11 tables
Abstract:Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.
[CV-16] Drifting Preference Optimization for One-Step Generative Models
链接: https://arxiv.org/abs/2606.02521
作者: Zhou Jiang,Yandong Wen,Zhen Liu
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 9 figures
Abstract:One-step text-to-image generators are attractive for deployment because they generate an image with a single forward pass, but preference finetuning them remains difficult: standard alignment methods often rely on policy likelihoods, denoising trajectories, differentiable reward gradients, or test-time optimization. We propose Drifting Preference Optimization (DrPO), an online preference-finetuning method for deterministic one-step generators. For each prompt, DrPO samples candidates from the current generator, ranks them with a target reward, and uses high- and low-scoring samples to synthesize a feature-space update direction. The update is a non-parametric dipole preference field plus a reference drift estimated from the frozen base generator, and is optimized through a detached feature-space regression target. The target reward is used only for ranking, so DrPO can train with large, black-box, or non-differentiable rewards while inference remains a single generator call. We evaluate DrPO on SD-Turbo and SDXL-Turbo with multiple target rewards and benchmarks, including HPSv3 and GenEval. DrPO improves alignment over reward-gradient-free one-step preference baselines and reduces HPSv3 training computation by 3.51\times under the matched effective-batch setting by removing reward-model backpropagation. Initial offline experiments suggest that sample-based gradient synthesis can also be used beyond online reward ranking.
[CV-17] oolFG: Towards Well-Grounded Fine-Grained Image Classification
链接: https://arxiv.org/abs/2606.02518
作者: Yu Xue,Haoxuan Qu,Zhuoling Li,Yihang Lou,Yan Bai,Hossein Rahmani,Jun Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained image classification (FGIC) has broad applications and has attracted significant research attention. In this paper, we explore a novel paradigm for solving FGIC by proposing \textbfToolFG, the first tool-integrated MLLM-based framework tailored to FGIC. ToolFG enables MLLMs to autonomously and flexibly use external tools during the reasoning process, actively interact with images, and collect verifiable visual cues for distinguishing highly similar categories in a more \textitreliable and \textitwell-grounded manner. To equip the model with such tool-use ability, we design a novel \textbfMCTS-guided tool-use knowledge distillation mechanism, which effectively mines tool-use- and FGIC-relevant knowledge from advanced proprietary MLLMs for model training. Furthermore, we propose a \textbfmodel-tool co-evolution mechanism that jointly refines the toolset and the model’s tool-use policy, driving them toward a mutually adapted and FGIC-specialized state. Extensive experiments demonstrate the effectiveness of our framework.
[CV-18] Not All Points Are Equal: Uncertainty-Aware 4D LiDAR Scene Synthesis CVPR2026
链接: https://arxiv.org/abs/2606.02510
作者: Xiang Xu,Alan Liang,Youquan Liu,Xian Sun,Linfeng Li,Lingdong Kong,Ziwei Liu,Qingshan Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: CVPR 2026 E2E3D Workshop; GitHub at this https URL
Abstract:Constructing faithful 4D worlds from LiDAR-acquired sequences is crucial for embodied AI, yet current generative frameworks apply uniform modeling capacity across all spatial regions. This ignores that perceptual difficulty varies dramatically within a single scan: distant surfaces, occluded boundaries, and small-scale objects carry far higher uncertainty than well-observed structures. We present U4D, a new framework that explicitly leverages spatial uncertainty to guide LiDAR scene generation in a “hard-to-easy” schedule. U4D derives per-point uncertainty maps via Shannon Entropy from a pretrained segmentor, then applies an unconditional diffusion stage to synthesize high-entropy areas with precise geometry, followed by a conditional completion stage that fills in the remaining regions using these structures as priors. A MoST (Mixture of Spatio-Temporal) block further maintains cross-frame coherence by dynamically balancing spatial detail and temporal continuity. Extensive experiments on nuScenes and SemanticKITTI demonstrate state-of-the-art scene fidelity, temporal consistency, and downstream performance.
[CV-19] Question-Aware Evidence Ledgers for Video Relational Reasoning CVPR2026
链接: https://arxiv.org/abs/2606.02506
作者: Yilin Ou,Mengshi Qi,Huadong Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical report for the VRR Challenge at the VideoLLMs Workshop, CVPR 2026
Abstract:The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.
[CV-20] GloResNet: A lightweight 3D CNN with global topological features for preterm brain injury prediction
链接: https://arxiv.org/abs/2606.02498
作者: Boyu Yuan,Jiamiao Lu,Weichuan Zhang,Benqing Wu,Tuo Wang,Changshan Wang,Changming Sun,Liang Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study introduces an automated deep learning framework for predicting brain injury (BI) in preterm infants from T2-weighted MRI (dHCP dataset). We propose GloResNet, a lightweight 3D CNN based on ResNet-10, pretrained on MedicalNet to address data scarcity. A global manifold mapping strategy first resamples each 3D volume to 128x128x128 and then applies subject-wise z-score intensity normalization, thereby preserving global topology while standardizing appearance. Training integrates mixup, class weighting, and test-time augmentation for robustness. In 5-fold cross-validation, GloResNet achieved 75.18% average accuracy (peak 81.82%), with specificity 0.81 and sensitivity 0.76. Results demonstrate that a topology-aware lightweight CNN has the capability to effectively predict neonatal BI, offering a non-invasive screening tool. The source code of this paper can be obtained from the GitHub repository: this https URL
[CV-21] MORPHOS: Autoregressive 4D Generation with Temporal Structured Latents
链接: https://arxiv.org/abs/2606.02491
作者: Minkyung Kwon,Jinhyeok Choi,Youngjin Shin,Jaeyeong Kim,JongMin Lee,Seungryong Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:We present MORPHOS, a novel autoregressive framework that generates dynamic 3D assets from videos across diverse representations, including meshes, 3D Gaussians, and radiance fields. Existing methods are typically limited to a single representation, struggle to model topological changes, or fail to maintain temporal consistency over long videos. To address these limitations, we introduce the Temporal Structured Latents (T-SLAT), a unified 4D representation that jointly encodes geometry and appearance along the temporal dimension. Leveraging T-SLAT, MORPHOS autoregressively generates dynamic 3D assets via causal attention, conditioning each frame on its preceding history to ensure temporal consistency while handling evolving topologies. We also propose a temporal-structural augmentation to mitigate error accumulation in autoregressive generation. MORPHOS achieves state-of-the-art performance in appearance and competitive results in geometry across multiple benchmarks, demonstrating superior generalization across various representations and robustness in long-horizon generation.
[CV-22] X-Stream: Exploring MLLM s as Multiplexers for Multi-Stream Understanding
链接: https://arxiv.org/abs/2606.02482
作者: Peiwen Sun,Xudong Lu,Huadai Liu,Yang Bo,Dongming Wu,Huankang Guan,Minghong Cai,Jinpeng Chen,Xintong Guo,Shuhan Li,Rui Liu,Xiangyu Yue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:While video streaming understanding has made significant strides, real-world applications, such as live sports broadcasting, autonomous driving, and multi-screen collaboration, inherently demand continuous, multi-stream interactions. However, existing benchmarks are confined to single-stream paradigms, leaving a critical gap in evaluating online, cross-stream reasoning. To bridge this, we introduce X-Stream, the first benchmark dedicated to multi-stream streaming understanding. Comprising 4,220 rigorously curated QA pairs across 932 videos, X-Stream evaluates 11 subtasks across multi-window, multi-view, and multi-device scenarios. Crucially, our dataset is constructed using a novel dual-verification pipeline that prevents over-reliance on a single stream. Furthermore, we pioneer the conceptualization of multi-modal large language models (MLLMs) as naive multiplexers, systematically evaluating their performance through the lens of Signal Multiplexing Theory. Our extensive online inference experiments reveal a stark reality: state-of-the-art MLLMs struggle significantly with concurrent streams, achieving only about 50% score and exhibiting poor proactive ability. Ultimately, X-Stream exposes the trade-off of current multiplexing schemes, providing both a practical evaluation protocol and empirical guidance for next-generation multi-stream agents.
[CV-23] Places in the Wild: A Large High-Resolution RAW Photograph Dataset for Ecologically Valid Vision Research
链接: https://arxiv.org/abs/2606.02481
作者: Michelle R. Greene
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 3 tables, 4 figures
Abstract:Large image datasets have accelerated progress in cognitive neuroscience and computer vision. However, most datasets are low-resolution, internet-sourced JPEGs with unknown capture conditions and limited spatial context. Places in the Wild is a dataset of 67,574 high-resolution photographs collected in situ across 810 physical locations spanning 260 basic-level scene categories, including indoor, urban, and natural environments. At each location, a 45-megapixel Canon EOS R5 mounted on a panoramic tripod captured 72 images at 5-degree horizontal intervals plus 12 images at varying elevations, yielding dense 360-degree viewpoint sampling. All images were recorded simultaneously as 14-bit RAW (CR3) files and compressed JPEGs, preserving sensor-level detail for analyses of luminance, contrast, color, and other image statistics. The dataset is accompanied by complete EXIF metadata and a suite of image-quality metrics. Places in the Wild supports research on viewpoint-dependent recognition in humans and models, training and evaluation of scene-understanding systems under realistic conditions, characterization of natural scene statistics, and experiments requiring near-full-field visual displays.
[CV-24] Retrieve Whats Missing: Coverag e-Maximizing Retrieval for Consistent Long Video Generation
链接: https://arxiv.org/abs/2606.02479
作者: Minseok Joo,Dogyun Park,Taehoon Lee,Kyujin Lee,Hyunwoo J. Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 10 figures, 5 tables
Abstract:Maintaining long-term geometric consistency remains challenging for long-horizon autoregressive video generation. Memory-augmented generative models address this by retrieving historical frames, but their effectiveness depends on two key design choices: what 3D-geometric evidence should represent past observations, and how memory frames should be selected from this evidence. Existing methods often rely on camera poses or field-of-view overlap, which are lightweight but too coarse to reason about pixel-wise visibility, or use explicit 3D reconstruction, which provides fine-grained evidence but is costly to maintain over long rollouts. We propose Coverage-Maximizing Retrieval-Augmented Generation (COVRAG), a depth-based memory retrieval framework that uses pretrained 3D priors to construct a target-view coverage map as lightweight 3D memory evidence. For frame selection, COVRAG maximizes residual coverage gain, iteratively retrieving frames that explain target-view regions not covered by the current context or previously selected memories. To improve scalability in long-video generation, we introduce sliding-window depth caching for efficient geometry estimation. Experiments on RealEstate10K and DL3DV10K show that COVRAG improves long-horizon geometric consistency while maintaining low latency compared to baselines.
[CV-25] MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence CVPR2026
链接: https://arxiv.org/abs/2606.02463
作者: Hilton Raj,Vishnuram AV
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to CVPR 2026 Foundation Models Meet Embodied Agents Workshop
Abstract:In 3D environments, Embodied Agents answer spatially relevant questions through reasoning from a mixture of modalities including natural language, RGB images, point clouds, depth maps and camera poses. Existing Vision-Language models (VLMs) are fine-tuned over a single modality. This completely ignores the question semantics which may favor a different modality than the finetuned modality. To address this, we propose MASER (Modality-Adaptive SpEcialist Routing), a lightweight framework that trains five different modality adapters of a shared VLM backbone and learns a neural routing policy that selects the best adapter based on the question during inference. We encode each question with a frozen sentence transformer and pass the embedding through a small Multi-layer Perceptron (MLP) trained on oracle adapter-accuracy labels. We evaluate our methodology over the Open3D-VQA benchmark and our evaluations show that no single modality is universally optimal – point-cloud answers are best in 51.5% of cases. MASER routes with 51.3% oracle agreement, outperforming a Random-Forest ablation (43.5%), with only a single adapter call per question.
[CV-26] Active Exploring like a Pigeon: Reinforcing Spatial Reasoning via Agent ic Vision-Language Models ICML2026
链接: https://arxiv.org/abs/2606.02459
作者: Wei Deng,Xianlin Zhang,Mengshi Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2026
Abstract:Enabling Vision-Language Models (VLMs) to perform spatial reasoning remains challenging. Existing approaches treat VLMs as passive observers, which is difficult for real-world applications. Moreover, reinforcement learning methods rely on sparse rewards, limiting their effectiveness for complex reasoning tasks. Inspired by pigeons’ building and exploiting cognitive maps for navigation, we propose a novel agentic pipeline for spatial reasoning. First, we introduce a new \emphdynamic cognitive map parameterizing scene layout as object positions and orientations, serving as persistent memory for new observations. Second, we propose a novel \emphSpatial Assertion Codes (SAC), Python expressions programmatically describing spatial relationships. By collaborating with the dynamic cognitive map, SAC enables verification of intermediate reasoning steps, providing dense reward signals. We optimize the model via supervised and reinforcement finetuning. Experiments on the MindCube benchmark demonstrate state-of-the-art performance with \emph80.5% overall accuracy, outperforming the best current method by \emph29.5 accuracy points (a relative improvement of \emph53.2%) on the challenging \textscRotation subset. Our code and data are open-sourced at this https URL.
[CV-27] Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior ICML2026
链接: https://arxiv.org/abs/2606.02453
作者: Xiang Li,Dianbo Liu,Kenji Kawaguchi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026 Spotlight
Abstract:Despite the remarkable fidelity of generative models, they frequently suffer from mode collapse. Existing strategies for enhancing diversity predominantly focus on intervening during the generation trajectory. We identify a critical oversight that the standard Gaussian initialization often causes trajectories to collapse into dominant modes because it is agnostic to the guidance potential landscape. In this work, we formulate selecting the initial noise from a guidance potential posterior, which effectively re-weights the prior towards diversity-rich regions. To sample from this distribution efficiently, we introduce Diversity-inducing Initialization (DivIn), which leverages Langevin dynamics to actively navigate the initialization landscape, steering initial noise away from collapsing regions while anchoring them to the valid data manifold. Our method serves as an inference-time diversity enhancement compatible with both diffusion and flow matching models. Extensive experiments show that DivIn exhibits a superior performance in both class-to-image and text-to-image scenarios. Furthermore, we highlight that as DivIn is orthogonal to trajectory-based methods, combining them significantly expands the diversity-quality Pareto frontier beyond what either achieves in isolation.
[CV-28] Reason -Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion
链接: https://arxiv.org/abs/2606.02450
作者: DongQing Liu,MengShi Qi,HongWei Ji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.
[CV-29] Spatial-Temporal Decoupled Reference Conditioning for Identity-Preserving Text-to-Video Generation
链接: https://arxiv.org/abs/2606.02441
作者: Yuheng Chen,Teng Hu,Yuji Wang,Qingdong He,Lizhuang Ma,Jiangning Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identity-preserving video generation (IPVG) aims to synthesize high-fidelity videos that follow text prompts while faithfully preserving a reference identity. Despite recent progress, existing IPVG methods still struggle to balance high-level semantic control and low-level identity fidelity. To bridge this gap, we propose ST-DRC, an effective Spatial-Temporal Decoupled Reference Conditioning framework for identity-preserving text-to-video generation. At the framework level, ST-DRC performs latent in-context feature injection by encoding the reference image with the video VAE and concatenating it with noisy video latents, enabling rich low-level identity details to be accessed without additional adapters. To separate identity-aware reference retrieval from appearance copying, we introduce TASS-RoPE, a Temporal-Adjacent Spatial-Shifted RoPE scheme that places reference tokens near the video sequence in time but shifts them in space, allowing reference information to flow through spatio-temporal attention while suppressing pixel-level copy-paste shortcuts. To further prevent shortcut learning and strengthen the otherwise diluted identity supervision in the diffusion objective, we combine appearance-invariant reference augmentation with face-guided identity objectives, encouraging the model to preserve identity under variations in color, pose, and layout. At inference time, we introduce a three-stream reference classifier-free guidance strategy that independently controls text adherence and reference fidelity. Experiments demonstrate that ST-DRC achieves strong identity preservation, prompt alignment, temporal consistency, and video quality with a lightweight design built on LTX-2.3. Our method ranks among the top submissions in the facial identity-preserving video generation track, validating the effectiveness of spatial-temporal decoupled reference conditioning.
[CV-30] Geometry-Aware Implicit Memory for Video World Models
链接: https://arxiv.org/abs/2606.02436
作者: Zhengxuan Wei,Xu Guo,Xinghui Li,Xunzhi Xiang,Min Wei,Yiran Zhu,Qiulin Wang,Xintao Wang,Pengfei Wan,Xiangwang Hou,Qi Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Video world models aim to simulate controllable visual environments, but long-horizon rollouts depend on what the model remembers after observations leave its native context window. Explicit memories retain frames or online 3D reconstructions, which can suffer from heuristic retrieval errors, redundant appearance storage, or reconstruction artifacts. Implicit memories compress history into a compact state, but existing designs are not explicitly constrained to encode cross-view scene geometry. We propose GIM-World, a geometry-aware implicit memory framework for video world models. A lightweight transformer encoder compresses variable-length history into fixed-size memory tokens, a camera-queryable geometry head distills 3D scene structure from a frozen foundation model into the memory during training, and an information-guided pruning rule keeps encoding cost bounded as history grows. The geometry teacher is discarded at inference, leaving a lightweight memory module. Experiments on MIND show that GIM-World better preserves long-horizon geometric and visual consistency than both explicit- and implicit-memory baselines.
[CV-31] GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics
链接: https://arxiv.org/abs/2606.02424
作者: Kaito Shiku,Ahtisham Fazeel Abbasi,Ryoma Bise,Yuichiro Iwashita,Kazuya Nishimura,Andreas Dengel,Muhammad Nabeel Asim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.
[CV-32] Edge Prediction for Roof Wireframe Reconstruction with Transformers CVPR2026
链接: https://arxiv.org/abs/2606.02406
作者: Gustav Hanning,Ludvig Dillén,Jonathan Astermark,Johanna Lidholm,Viktor Larsson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the 3rd Urban Scene Modeling (USM3D) Workshop at CVPR 2026
Abstract:This paper presents a competitive solution to the S23DR Challenge 2026, which aims to reconstruct 3D house roof wireframe models from sparse SfM point clouds and ground-level semantic segmentations and depth maps. Our proposed method utilizes an end-to-end Transformer encoder-decoder architecture inspired by DETR. To effectively process the geometric and semantic data, the sparse SfM point cloud input is dynamically subsampled based on semantic priority and augmented with Gestalt and ADE20k class features. To further increase segmentation context, we fuse the point features with additional Gestalt feature encodings which are obtained by projecting the points into latent feature maps produced by a frozen autoencoder. Learned query embeddings are then decoded directly into 3D wireframe edges via cross-attention mechanisms. Evaluated on the “HoHo 22k” dataset, our approach significantly outperforms both handcrafted and learned baselines, achieving a Hybrid Structure Score (HSS) of 0.6476 and securing the second-highest position on the challenge’s private leaderboard.
[CV-33] Explainable Forensics of Manipulated Segments in Untrimmed Long Videos ICML2026
链接: https://arxiv.org/abs/2606.02402
作者: Yue Feng,Jingjing Li,Qijia Lu,Wei Ji,Jingrou Zhang,Fei Shen,Xiao Li,Yizhen Jia,Qiang Chen,Limin Wang,Wentong Li,Jie Qin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:The rapid advancement of AI-driven video generation has transformed content creation, while simultaneously increasing the risk of misinformation through localized manipulations in long-form videos. Existing video forensic methods predominantly operate on short, independent clips, and thus fail to capture realistic scenarios where AI-generated content is sparsely embedded within otherwise authentic footage. To bridge this gap, we formulate the task of Temporal AI-Generated Segment Localization and Explanation, which targets authenticity detection, temporal localization, and interpretable analysis of manipulated segments in untrimmed long videos. We further introduce TASLE, a large-scale benchmark comprising 12,472 untrimmed videos with diverse manipulation patterns and rich annotation signals, including temporal boundaries, authenticity labels, and segment-level rationales. In addition, we propose MSLoc, a coarse-to-fine forensic baseline that combines a boundary-sensitive proposal generation module for efficient long-video scanning with an MLLM-based refinement module for precise boundary localization and interpretable reasoning. Experiments validate the effectiveness of the proposed baseline, highlighting the importance of segment-level explainable forensics for long-form AI-generated video analysis. Our dataset and code are publicly available at this https URL.
[CV-34] Honey I Shrunk the Arc de Triomphe!
链接: https://arxiv.org/abs/2606.02379
作者: Yuanbo Xiangli,Hanyu Chen,Xueqing Tsang,Noah Snavely
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ‘‘scale-collapse’’ phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.
[CV-35] PRIMA: Boosting Animal Mesh Recovery with Biological Priors and Test-Time Adaptation
链接: https://arxiv.org/abs/2606.02366
作者: Xiaohang Yu,Ti Wang,Mackenzie Weygandt Mathis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PRIMA (PRIors for Mesh Adaptation), a framework for robust 3D quadruped mesh recovery under severe species and pose imbalance. Existing animal reconstruction methods often regress toward mean shapes and poses due to limited 3D supervision and long-tailed species distributions, resulting in poor generalization to underrepresented animals and rare articulations. PRIMA addresses this challenge through three key contributions. First, we incorporate BioCLIP embeddings as biological priors to inject semantic and morphological knowledge into the reconstruction process, enabling more accurate and generalizable shape prediction across diverse quadrupeds. Second, we introduce a test-time adaptation (TTA) strategy that refines SMAL predictions using 2D reprojection constraints together with auxiliary keypoint guidance, improving pose and shape estimation while enabling the generation of high-quality pseudo-3D annotations from existing 2D datasets. Third, leveraging this TTA framework, we construct Quadruped3D, a large-scale pseudo-3D dataset that covers diverse species and pose variations to systematically improve model performance. Extensive experiments on Animal3D, CtrlAni3D, Quadruped2D, and Animal Kingdom demonstrate that PRIMA achieves state-of-the-art results, with particularly strong improvements on underrepresented species and challenging poses. Our results highlight the importance of biological priors and adaptation-driven data expansion for scalable and generalizable animal mesh recovery. Code is available at this https URL.
[CV-36] Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains
链接: https://arxiv.org/abs/2606.02357
作者: Garvin Guo,Donglei Yu,Yu Chen,Xiang Wang,Shuai Li,Xinpei Zhao,Huaxing Liu,Qinghao Wang,Minpeng Liao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented multimodal agents show strong benchmark gains, often taken as evidence that agents have learned to use tools. We argue that this interpretation can be premature: a tool-call trace alone does not show whether the tool supplied answer-critical information. We study two representative ``thinking with images’’ agents, Thyme and DeepEyesV2, across real-world understanding, OCR, chart understanding, and mathematical reasoning. Each agent is compared with its Tool-Free counterpart and with a Pure-Text Reasoner trained from the same source pool without tool-calling trajectories. Tool access yields little consistent aggregate improvement, does not reliably reduce generated-token cost, and leaves only a small tool-only solved set: 93% of DeepEyesV2’s tool-solved problems and 96% of Thyme’s are also solved by at least one non-tool setting. Mechanism ablations further show that the full tool-use loop does not consistently outperform either the tool-call format or the returned execution result alone. In the settings we study, the analyzed agents appear to learn tool-calling patterns more reliably than tool-contributed capabilities, suggesting that evaluation should distinguish tool availability from whether tools actually expand what agents can solve.
[CV-37] Multi-modal Video Representation Alignment for Robust Self-supervised Driver Distraction Detection ITSC2026
链接: https://arxiv.org/abs/2606.02352
作者: David J. Lerch,Livien Majer,Zeyun Zhong,Manuel Martin,Frederik Diederichs,Rainer Stiefelhagen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the IEEE ITSC 2026
Abstract:Robust self-supervised learning of multi-modal video representations is critical for real-world applications such as driver distraction detection, where multiple sensors provide complementary but noisy signals. Conventional contrastive objectives, such as InfoNCE, assume all negatives are equally informative and all positives are reliable. However, this assumption is frequently violated in multi-modal data due to viewpoint changes, occlusions, or semantic overlap across modalities. In this work, we propose a novel framework for multi-modal global alignment that addresses these challenges by jointly modeling faulty negatives and unreliable or faulty positives. We introduce soft targets derived from cycle-consistency scores to relax the hard-negative assumption, and a weighting mechanism based on similarity distributions to mitigate the impact of noisy or faulty positives. Our approach extends traditional pairwise alignment to a principled global multi-modal setting, aggregating alignment information across all modality pairs. We evaluate our method on the DriveAct dataset, demonstrating that it consistently outperforms both pairwise and existing global alignment baselines across RGB, IR, Depth, and Skeleton modalities. Cross-view ablation studies further show strong generalization to unseen camera perspectives, highlighting the robustness of our representations. Overall, our framework provides a scalable and effective solution for self-supervised global multi-modal representation learning, enabling reliable driver distraction detection and pioneering in real-world multi-modal video understanding. Our code will be published on GitHub.
[CV-38] ROPHIES: Temporal Reconstruction of Places Humans and Cameras from Multi-view Videos
链接: https://arxiv.org/abs/2606.02350
作者: Jinpeng Liu,Yukang Xu,Yutong Li,Xingyu Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing humans and their surrounding environments in a globally consistent 4D space is essential for comprehensive perception. However, prior works typically assume single-view inputs or decouple humans, scenes, and cameras, making them unable to recover coherent geometry, stable motion, and physically aligned trajectories. These limitations motivate us to introduce a new task: unified human-scene-camera reconstruction from multi-view videos, which aims to jointly estimate dynamic humans, static scenes, and camera poses in one global coordinate frame. We propose TROPHIES–Temporal Reconstruction of Places, Humans, and Cameras from Multi-view Videos-a unified framework tailored for this task. TROPHIES features a Human Branch that models humans through temporal and spatial reasoning, and a Scene Branch that reconstructs static geometry with human-aware attention. A global alignment and optimization module couples both branches by enforcing scale consistency, contact priors, and cross-view temporal coherence. Experiments on EgoHuman and EgoExo4D demonstrate that TROPHIES achieves globally aligned, physically plausible 4D reconstructions and consistently outperforms existing paradigms in both global fidelity and human-scene consistency.
[CV-39] VEDAL: Variational Error-Driven Asynchronous Learning for 3D Gaussian Splatting Pruning
链接: https://arxiv.org/abs/2606.02346
作者: Aoduo Li,Jiancheng Li,Huan Ye,Hongjian Xu,Shiting Wu,Xiujun Zhang,Zimeng Li,Xuhang Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 5 figures. Accepted by CGI 2026
Abstract:3D Gaussian Splatting (3DGS) achieves remarkable novel view synthesis quality with real-time rendering, yet suffers from excessive memory consumption due to millions of Gaussian primitives. Existing pruning methods rely on heuristic importance scores or synchronous batch updates, leading to suboptimal compression and training instability. We propose VEDAL, a principled framework that formulates Gaussian pruning as variational free energy minimization. Our approach introduces (1) a prediction-error gating mechanism that asynchronously activates pruning based on per-Gaussian reconstruction uncertainty, and (2) a variational uncertainty head that models pruning decisions as latent variables with learnable priors. The free energy objective naturally balances reconstruction fidelity against model complexity through an information-theoretic lens. Extensive experiments on Mip-NeRF 360, TanksTemples, and Deep Blending demonstrate that VEDAL achieves 5.2x compression with only 0.31 dB PSNR drop, outperforming PUP 3D-GS by +0.05 dB at a higher compression ratio and LightGaussian by +0.35 dB at comparable quality, while maintaining real-time rendering at 185 FPS.
[CV-40] Detecting Pen-In-Air States from Video: A Proof-of-Concept Toward Complementary Handwriting Analysis
链接: https://arxiv.org/abs/2606.02342
作者: Lauren Sismeiro,Remy Plastre,Binbin Xu,Frederic Puyjarinet,Gerard Dray
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted for 12th International Conference on Computer Technology Applications (ICCTA 2026)
Abstract:Dynamic aspects of handwriting are critical for assessing developmental disorders such as dysgraphia and are typically captured using digitizing tablets. However, tablet-based sensing restricts analysis of Pen-Up behavior to a short proximity range above the writing surface, potentially missing high-lift in-air movements. As a proof of concept, we investigate whether top-view video can provide a complementary source of information for inferring pen-contact states without relying on tablet proximity sensing. We propose an interpretable hybrid pipeline combining pen-tip tracking using a YOLO-based detector with kinematic feature extraction and machine learning classification. A pilot dataset of diverse handwriting videos was manually annotated at the frame level and evaluation used a Leave-One-Video-Out (LOVO) protocol. The method achieved reliable event-level detection of Pen-Up segments, with an F_2 score up to 0.805, consistent with the emphasis on recall in a screening-oriented setting. These results support the feasibility of video-based Pen-Up detection as a low-cost and non-intrusive complement to digitizing tablets, and provide a foundation for future large-scale studies.
[CV-41] Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging
链接: https://arxiv.org/abs/2606.02339
作者: Tim Nielen,Sameer Ambekar,Johannes Kiechle,Daniel M. Lang,Julia A. Schnabel
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model’s representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.
[CV-42] Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates
链接: https://arxiv.org/abs/2606.02331
作者: Pengfei Jin,Yiqi Tian,Kailong Fan,Bingjie Qi,Quanzheng Li
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Diffusion-based inverse problem solvers can produce realistic reconstructions, but realism alone does not ensure that the recovered details are supported by the measurement. We study this failure as measurement-conditioned hallucination: visually meaningful content that is either implausible or inconsistent with the measured instance. Our analysis separates Bayes-rule-based diffusion inverse solvers into a prior update and a measurement-conditioning step, showing that hallucinated content can enter through the prior-side proposal before the measurement correction is applied. Motivated by this view, we propose Robust Prior Update (RPU), a solver-level module that probes the local stability of the diffusion prior update, re-anchors the resulting displacement at the current iterate, and leaves the measurement update unchanged. We instantiate RPU in DPS and evaluate it on FFHQ and ImageNet inverse problems using automatic metrics and human faithfulness studies. On FFHQ, RPU improves PSNR and LPIPS over DPS across box inpainting, Gaussian deblurring, and motion deblurring. In human judgments, RPU receives 91.9% of blind non-tie majority preferences and 91.1% of ground-truth-assisted non-tie preferences on FFHQ box inpainting, while the ImageNet Gaussian reader study is tie-heavy but favors RPU among non-tie cases. These results support a targeted claim: robustifying the prior update can improve instance faithfulness in diffusion inverse solvers, especially when the prior shapes weakly constrained content.
[CV-43] raining-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning CVPR2026
链接: https://arxiv.org/abs/2606.02321
作者: Yang Liu,Qianqian Xu,Peisong Wen,Siran Dai,Qingming Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026, VidLLMs workshop
Abstract:Recent advances in large vision-language models have expanded video retrieval from simple text-based search to more flexible scenarios, where users may specify the desired result through both visual examples and textual instructions. In the CVPR 2026 Reason-Aware Composed Video Retrieval Challenge, the system is required to retrieve a target video according to a reference video and a modification instruction. To address this task, we develop Visual Representation-Guided Video-LLM Reasoning for Training-Free Composed Video Retrieval. Our framework first uses frozen DINOv3 models to obtain a compact set of visually relevant candidates, and then applies large vision-language models to evaluate whether each candidate satisfies the modification instruction. A final reasoning-based refinement is further performed on the top candidates to improve the first-ranked prediction. Without training, our system achieves 48.78 Recall@1 and 51.48 Recall@5 on the test set. Future work may further improve retrieval accuracy through stronger video-LLMs and detailed integration between visual representations and language reasoning.
[CV-44] Deep Learning for Remote Sensing to Improve Flood Inundation Mapping
链接: https://arxiv.org/abs/2606.02310
作者: Yogesh Bhattarai,Vijay Chaudhary,Wai Lim Kim,Sanjib Sharma
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper has been selected as the top 10 student finalists in IGRASS 2026 paper competition
Abstract:Flooding is the most pervasive natural disaster worldwide. Timely and accurate flood inundation mapping are essential for informing disaster risk management. Optical satellite missions provide high-resolution, multispectral observations critical for flood detection and inundation mapping. However, their operational utility is severely constrained by cloud cover during extreme precipitation events. Conventional cloud-removal techniques based on temporal compositing or interpolation often fail to capture inundation dynamics. In this study, we introduce a cloud-removal framework for flood imagery based on Denoising Diffusion Probabilistic Models, leveraging the Masked Diffusion Transformer architecture. The proposed approach exploits self-attention mechanisms to capture wider spatial context and employs masked token modeling to explicitly learn the reconstruction of cloud-obscured regions. Trained on multispectral Sentinel-2B flood scenes with realistic cloud patterns, the model generates cloud-free image realizations that preserve both visual fidelity and hydrological consistency. Reconstruction performance is evaluated using standard image quality metrics alongside flood-specific hydrological measures, demonstrating improved continuity of water bodies and preservation of spectral signatures critical for water detection indices. The results indicate that diffusion-based generative modeling offers a robust and physically consistent alternative for cloud removal in optical flood monitoring, enabling more reliable, continuous observations to support disaster risk management and flood-related decision making.
[CV-45] Measurement Geometry and Design for Trustworthy Generative Inverse Problems
链接: https://arxiv.org/abs/2606.02309
作者: Pengfei Jin,Na Li,Quanzheng Li
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.
[CV-46] Cross-Domain Dead Tree Detection via Knowledge Distillation in Aerial Imagery
链接: https://arxiv.org/abs/2606.02303
作者: Anis Ur Rahman,Mete Ahishali,Einari Heinaro,Samuli Junttila
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 6 figures, journal
Abstract:Detecting dead trees in aerial imagery is vital for assessing forest health, especially as tree mortality increases globally due to climate change, but domain variability and scarce labeled data often limit model generalization. This study advances the TreeMort-1T-UNet (Tree Mortality 1-Task U-Net) model, initially trained on Finnish aerial imagery (source domain), by applying knowledge distillation (KD) to adapt it to various target domains, including Polish, German, and Estonian datasets representing diverse forest types. We assess four KD variants: Basic, Self, Feature-level, and Ensemble, against a fine-tuning baseline, using Mean Tree IoU, Instance F1-score, Instance Precision, and Mean Centroid Error as key metrics, alongside representational analyses (e.g., cosine similarity, CKA, SSIM, t-SNE, and linear probing) for domain invariance. Feature-level KD outperforms others, yielding a Mean Tree IoU of 0.106, Instance F1-score of 0.63, Instance Precision of 0.55, and Mean Centroid Error of 3.039 on the Polish dataset, with robust precision across other target domains (e.g., 0.15 on Finnish, 0.67 on Polish, 0.60 on German, 0.59 on Estonian). It excels in low-data scenarios with fewer false positives and shows superior representational invariance (e.g., higher deep-layer CKA/SSIM, better domain mixing in t-SNE, and linear probing AUC of 0.95), making it ideal for precision-critical forestry applications. Additional ablation studies confirm that key components like feature alignment enhance its performance balance across metrics. Our findings demonstrate KD’s potential to enhance transfer learning in remote sensing, offering a scalable, domain-robust tool for ecological monitoring and sustainable forest management.
[CV-47] Neural Acquisition Representation of Subsurface Scattering
链接: https://arxiv.org/abs/2606.02292
作者: Arjun Majumdar,Raphael Braun,Hendrik Lensch
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:We present a method to acquire and estimate the sub-surface scattering properties of light transport at a highly detailed level by learning the pixel footprint response at each point on the object surface. The reconstruction leverages 3D scanning techniques as input to a U-Net CNN. A stereo projector-camera setup using phase-shifted profilometry (PSP) patterns efficiently captures the data for a variety of scattering objects. Reconstructing dense pixel footprints allows for relighting with arbitrary high-resolution projector patterns. The final output is a relit color image. Qualitative and quantitative comparison against illuminated real-world captured images demonstrate that the predicted footprints are almost identical to the actual responses. The same model is trained for multiple views across multiple objects such that the learned representations can be used to generalize to unseen sub-surface scattering materials as well.
[CV-48] Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset ITSC2026
链接: https://arxiv.org/abs/2606.02273
作者: David J. Lerch,Sarath Mulugurthi,Manuel Martin,Frederik Diederichs,Rainer Stiefelhagen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE ITSC 2026
Abstract:Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the DriveAct dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled DriveAct dataset we create a new DriveAct description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new DriveAct description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our DriveAct description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our DriveAct description dataset and code will be publicly available on GitHub.
[CV-49] From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data
链接: https://arxiv.org/abs/2606.02268
作者: Yuming Zhao,Junhui Hou,Qijian Zhang,Jia Qin,Ying He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geometric analysis fundamentally distinguishes between \textitextrinsic and \textitintrinsic perspectives. The dominant paradigm in current 3D representation learning relies on either extrinsic spatial structures or high-level semantics, struggling to capture the essence of shape identity and underlying manifold topology. To bridge this gap, we introduce a novel 3D representation learning paradigm, namely \textbfPRISM, for \textbfPre-training, which learns isometric embeddings by \textbfRecovering the \textbfIntrinsic \textbfSurface geodesic \textbfMetric. PRISM incorporates a topology-enforcing objective that explicitly constrains the structure of latent space, alongside a specialized two-stage training recipe mitigating sample imbalance inherent in the distribution of geodesic distances. Experiments demonstrate that our approach shows satisfactory accuracy, robustness, and high efficiency in geodesic distance prediction and achieves superior performance across diverse downstream tasks, including shape recognition, surface parameterization, and non-rigid correspondence. The code will be publicly available at this https URL.
[CV-50] A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs
链接: https://arxiv.org/abs/2606.02267
作者: Nicolas Stalder,Benjamin F. Grewe,Matteo Saponati,Pau Vilimelis Aceituno
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Main: 8 pages, 3 figures, 2 Tables. Supplement: 10 pages, 7 figures, 6 Tables
Abstract:The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined. Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only \sim 35% of the training FLOPs, using a model with \sim 50% less parametets, trained with \sim 33% of the epochs and \sim 15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across 3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.
[CV-51] Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark
链接: https://arxiv.org/abs/2606.02246
作者: Maria Santos-Villafranca,Jesus Bermudez-cameo,Alejandro Perez-Yus,Giovanni Maria Farinella,Antonino Furnari
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:To operate in the physical world, embodied agents must perceive their environment in an “always-on” fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.
[CV-52] owards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification
链接: https://arxiv.org/abs/2606.02242
作者: Karina Kvanchiani,Timur Mamedov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.
[CV-53] Chroma Clues: Leverag ing Color Statistics to Detect Synthetic Images
链接: https://arxiv.org/abs/2606.02224
作者: Lea Uhlenbrock,Davide Cozzolino,Christian Riess
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The evolution and dissemination of AI-synthesized images is occurring at an unprecedented rate. Image generators are making rapid progress in their goal of perfectly imitating natural images, which also challenges image forensics. In this work, we exploit an underexplored cue in current generative models, namely their weakness to imitate color statistics of natural images. We first show that the LPIPS loss used for training image generators is less sensitive to chrominance than to luminance, which may lead to statistical discrepancies in the colors of synthetic images. Building on this observation, we then introduce six hand-crafted color transformations and a method to learn a task-optimized color transform to statistically expose generated images. These transformations can be used in various ways. First, we define color-sensitive features at pixel-level or patch-level. A simple, interpretable classifier achieves with these features an average generalization accuracy of 93.27% and strong robustness against six types of post-processing. Second, we demonstrate that the transformations exhibit characteristic visual noise patterns in natural and synthetic image areas, which enables an intuitive visual image evaluation. Third, we demonstrate that the transforms can enhance color patterns in generated images for improved multiclass attribution. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.02224 [cs.CV] (or arXiv:2606.02224v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.02224 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-54] CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations ICML2026
链接: https://arxiv.org/abs/2606.02221
作者: Chengfeng Wu,Tao Zou,Yanru Wu,Jingge Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by ICML 2026
Abstract:Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL), a causally motivated representation-centric framework that encourages a structured semantic-residual factorization of the shared representation, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings. Code is publicly available at this https URL.
[CV-55] Symmetry-Aware 9D Pose Estimation with Sim(3)-Consistent Feature and Spherical Inception Convolution
链接: https://arxiv.org/abs/2606.02219
作者: Panfei Cheng,Hongshan Yu,Wenrui Chen,Xiaojun Tang,Jian Liu,Naveed Akhtar
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:Object pose estimation is a fundamental problem for an agent system to perceive or manipulate objects in images or videos. However, current instance-level methods struggle with generalization to unseen objects. Category-level methods seek to address this, but remain constrained by the complexities of learning in the non-linear Sim(3) space and intra-class variations. To address these challenges, We propose an effective method for category-level object pose estimation with two key innovations: (1) A translation/size estimator, featuring a semantic-guided symmetry-aware module that leverages robust generalization capabilities of a large vision model (LVM) to infer symmetry points, resulting in accurate translation and size without shape priors. This result serves as a precomputed cue for rotation estimation, thereby reducing the difficulty of learning in the non-linear Sim(3) space and laying a robust foundation for tackling the inherently more challenging rotation estimation. (2) A feature fusion module, based on our proposed spherical large-kernel inception convolution, fuses semantic features from the LVM with systematically computed geometric features to extract essential pose features from intra-class variations by modeling long-range dependencies without excessive computational cost. Built on these innovations, we achieve SOTA on benchmarks and real-world scenes, while developing a robust robotic picking system capable of handling diverse objects. Our code will be available at the project page: \hypersetupurlcolor=bluethis https URL.
[CV-56] Order within Chaos: Capturing Intrinsic Energy Anomalies for AI-Manipulated Image Forgery Localization ICML2026
链接: https://arxiv.org/abs/2606.02178
作者: Yiming Wang,Baiqi Wu,Qingming Li,Jiahao Chen,Tong Zhang,Shouling Ji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Recent advancements in generative AI have led to image editing models capable of producing realistic forgeries that evade traditional image forgery localization methods, as these approaches depend on physical noise absent in synthetic data. To address this challenge, we theoretically demonstrate that the diffusion process inherently suppresses local high-frequency variance, creating a statistical energy gap that is distinguishable from the natural entropy of optical imaging. Guided by this insight, we propose FLAME, a unified framework that utilizes a LAD map to capture these intrinsic anomalies, coupled with a parameter-efficient adapter for SAM to achieve precise, pixel-level forgery localization. Furthermore, to bridge the lag between forensic benchmarks and evolving generative models, we introduce EditStream, an automated pipeline for continuous, instruction-based training data synthesis. Extensive experiments demonstrate that FLAME establishes a new state-of-the-art, significantly outperforming previous methods on AI-generated forgery datasets while effectively generalizing to unseen generative architectures. Our code is available at this https URL.
[CV-57] Closing the Alignment-Maturity Gap in Federated Prototype Learning
链接: https://arxiv.org/abs/2606.02172
作者: Mario Casado-Diez,Alejandro Dopico-Castro,Verónica Bolón-Canedo,Bertha Guijarro-Berdiñas
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation’s participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.
[CV-58] InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark
链接: https://arxiv.org/abs/2606.02171
作者: Shiyu Wang,Ziyu Liu,Chaoyi Yu,Yujie Yin,Zhongqian Mao,Jing Chen,Jiaqi Song,Yunshi Lan,Yan Wang(East China Normal University, Shanghai, China)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 22 figures
Abstract:Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbfInsightVQA, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbfInsightVQA-Bench, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbfInsightNet, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.
[CV-59] Disentanglement-Based Equivariant Learning for Compositional VQA
链接: https://arxiv.org/abs/2606.02168
作者: Zhou Du,Zhaoquan Yuan,Xiao Wu,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by IEEE Transactions on Multimedia
Abstract:Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.
[CV-60] Ultra Diffusion Poser: Diffusion-Based Human Motion Tracking From Sparse Inertial Sensors and Ranging-Based Between-Sensor Distances CVPR2026
链接: https://arxiv.org/abs/2606.02153
作者: Dominik Hollidt,Tommaso Bendinelli,Christian Holz
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: CVPR 2026 - Computer Vision and Pattern Recognition
Abstract:Methods using inertial measurement units (IMUs) provide a wearable alternative to camera-based motion capture. To mitigate drift from inertial signals, recent sparse inertial pose estimators integrate inter-sensor distances measured by ultra-wideband (UWB) ranging. So far, UWB distances have only been used as an additional input feature, ignoring the physical constraints they impose on sensor positions. However, these distances can also be used to reconstruct the underlying 3D sensor layout, which in turn provides more informative input for pose reconstruction. We propose Ultra Diffusion Poser, a diffusion model that explicitly models these geometric constraints. It includes a Spatial Layout Module that analytically reconstructs the 3D sensor positions from UWB measurements. These sensor positions are used alongside IMU signals and UWB distances as a conditioning signal during diffusion. Still, network predictions can violate inter-sensor distance measurements. To address this, we introduce UWB-Diffusion Guidance, which encourages alignment between predicted poses and measured distances during diffusion sampling. Together, these contributions enable our model to achieve state-of-the-art performance, reducing joint position error by up to 22% over prior work.
[CV-61] Rethinking Evaluation Paradigms in IBP-based Certified Training ICML2026
链接: https://arxiv.org/abs/2606.02134
作者: Konstantin Kaulen,Hadar Shavit,Holger H. Hoos
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural–certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.
[CV-62] Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization
链接: https://arxiv.org/abs/2606.02129
作者: Liyuan Ma,Xueji Fang,Guo-Jun Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image customization learns target subjects from reference concept images and generates conditioned images per text prompts, mainly modifying styles or backgrounds. Prevailing methods adopt fine-tuning to pack diverse concept attributes into a unified latent embedding, yet entangled attributes hinder elimination of irrelevant disturbances from style and background. To address this issue, we propose Equilibrated Diffusion, a frequency-driven approach that disentangles tangled concept features for balanced customization and consistent text-visual matching. Unlike conventional methods learning full concepts with shared embeddings and unified tuning, our work utilizes the inherent link between image frequency components and semantics: low frequencies represent subject content and high frequencies correspond to styles. We decompose concepts in frequency space and optimize each embedding independently. This separate optimization enables the denoiser to capture style detached from subject identity and generalize better to unseen stylistic prompts. Merging multi-frequency embeddings preserves the model’s original spatial customization ability. We further deploy mask-guided diffusion to restrict irrelevant background changes and boost text alignment. Residual Reference Attention (RRA) is inserted into spatial attention to retain subject structure and identity consistency. Experiments prove Equilibrated Diffusion exceeds mainstream baselines on subject fidelity and text adherence, verifying our method’s superiority.
[CV-63] Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection
链接: https://arxiv.org/abs/2606.02120
作者: Boyu Han,Qianqian Xu,Shilong Bao,Zhiyong Yang,Ruochen Cui,Qingming Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.
[CV-64] Multimodal Action Diffusion for Robust End-to-End Autonomous Driving
链接: https://arxiv.org/abs/2606.02105
作者: Jorge Daniel Rodríguez-Vidal,Diego Porres,Gabriel Villalonga Pineda,Antonio M. López Peña
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. June 1st, 2026. Corresponding author: Jorge Daniel Rodríguez-Vidal
Abstract:End-to-End Autonomous Driving (E2E-AD) systems have largely converged on predicting intermediate trajectory waypoints, delegating final control to hand-crafted controllers with GPS access. Direct control-signal prediction (outputting throttle, steer and brake in an end-to-end fashion) remains underexplored, and critically, the role of action multimodality in such systems is not well understood. We argue that moving beyond deterministic, single-action outputs is not merely a modelling choice, but a key driver of driving performance, representational quality, and training stability. To validate this, we introduce the Action Diffusion Transformer (ADT), an anchor-free diffusion transformer trained with a MSE objective that natively models the multimodal distribution of plausible driving actions. Rather than committing to a single deterministic command, ADT generates K action candidates and selects the most suitable one at inference via Nearest Neighbour Matching (NNM). Beyond strong benchmark numbers, we show that action multimodality yields measurable benefits in learned representations and behavioral consistency, effects that deterministic architectures cannot replicate. ADT surpasses previous state-of-the-art on the challenging closed-loop Bench2Drive benchmark while achieving ten times lower latency, demonstrating that expressive, multimodal action modelling is both practically efficient and conceptually essential for robust end-to-end driving.
[CV-65] WebSpline: Structure-Informed Splines for Real-Time 3D Gaussians from Monocular Videos
链接: https://arxiv.org/abs/2606.02096
作者: Jongmin Park,Jeonghwan Yun,Minh-Quan Viet Bui,Munchurl Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first two authors contributed equally to this work (equal contribution). Please visit our project page at this https URL
Abstract:Dynamic scene reconstruction from monocular videos remains highly challenging, as existing methods often struggle to balance global structural coherence and local fine-grained details under limited multi-view cues. To address this challenge, we propose WebSpline, a novel dynamic 3D Gaussian framework that enables structurally coherent and high-fidelity reconstruction from monocular videos with fast rendering. The core of WebSpline is the Structure-Informed Spline (SIS) representation, which models each dynamic Gaussian trajectory using a learnable cubic Hermite spline whose motion is structurally organized with an auxiliary Structural Proxy Graph (SPG). The proposed framework is optimized in two stages: (i) in the first stage, the SPG is initialized from 2D point tracks and refined with temporal rigidity regularization to establish structural coherence for moving objects across the sequence; and (ii) in the second stage, the SIS representation is initialized from the refined SPG and optimized under both spatial and structural neighborhood constraints. At inference, Gaussian motion is obtained solely by evaluating the learned SIS, enabling fast rendering. Extensive experiments on the challenging monocular dynamic scene benchmarks, iPhone and NVIDIA, demonstrate that our WebSpline achieves state-of-the-art rendering quality while rendering over 10 times faster than WorldTree, the second-best method on the iPhone dataset.
[CV-66] FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation
链接: https://arxiv.org/abs/2606.02090
作者: Xueji Fang,Liyuan Ma,Jianhao Zeng,Jinjin Cao,Mingyuan Zhou,Guo-Jun Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion transformer (DiT) has been widely adopted in the generative diffusion field, advancing the denoising of query tokens through attention and Feed-Forward (\textFFN) layers. FFN actually acts as the key-value vocabulary for decoding visual contents where the value embeds the visual semantical knowledge. We present that focusing on critical query tokens corresponding to more complex details and encouraging the model to improve these tokens is essential for fine-grained visual generation. To this end, we propose FocusDiT, which applies a Masking scheme to focus on critical query tokens that are exclusively fed into FFN. The masked queries can retrieve visual tokens from the FFN vocabularies, and use them to decode their visual details. Extensive text-to-image experiments validate the effectiveness of token masking in enhancing generative performance.
[CV-67] FACT: A Simple and Efficient Framework for Active Finetuning
链接: https://arxiv.org/abs/2606.02079
作者: Wenshuai Xu,You Song,Yuzhuo Cui,Minjie Ren,Qingjie Liu,Zhenghui Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACCEPTED for publication as a REGULAR paper in the IEEE Transactions on Image Processing (T-IP)
Abstract:The main goal of active finetuning is to improve a pretrained model’s performance on a specific task or domain by finetuning it with carefully selected informative or challenging data. Previous research has predominantly focused on the active aspect (i.e., data selection) while uniformly employing full finetuning for model adaptation, which inevitably distorts pretrained features due to distribution shift. This issue becomes particularly pronounced when the model size is large relative to the finetuning data quantity, leading to heightened overfitting risks. To address this critical gap, we formally outline the FiAF task that emphasizes systematic exploration of finetuning methodologies in active learning. We propose FACT, a three-phase hierarchical finetuning framework featuring both efficiency and simplicity, specifically designed for active finetuning scenarios. Our comprehensive experiments span: (1) Three major dataset categories encompassing classic (CIFAR10, CIFAR100, ImageNet-1k), imbalanced (CIFAR10-LT, CIFAR100-LT), and fine-grained (StanfordCars, FGVCAircraft) image classification datasets, each evaluated under 3-5 distinct sampling ratios; (2) Diverse pretrained architectures including Convolutional Neural Network (ConvNeXt), Vision Transformer (ViT), and Vision LSTM (ViL) networks; (3) A systematic investigation of frozen feature augmentation (FroFA) strategies. (4) A comprehensive and rigorous analysis of efficiency and generalizability. The results demonstrate significant improvements with strong generalization and robustness. Notably, under low sampling ratios, our framework achieves remarkable performance gains of over 20% on the ViT model for CIFAR10, CIFAR100, and ImageNet-1k benchmarks. This systematic approach establishes new state-of-the-art performance while maintaining parameter efficiency, proving particularly effective when labeled data is scarce.
[CV-68] Fast and Lightweight Novel View Synthesis with Differentiable Multiplane Image
链接: https://arxiv.org/abs/2606.02068
作者: Kaidi Zhang,Guanxu Zhu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, novel view synthesis has witnessed remarkable progress, with mainstream methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) delivering impressive results. However, these approaches often struggle to balance rendering speed and model size, and their optimization-based training can be highly time-consuming. Furthermore, they typically rely on dense observations, often failing to produce satisfactory results under sparse-view conditions. Although feed-forward reconstruction significantly reduces the optimization time of 3DGS, its pixel-aligned formulation generates millions of Gaussians from a single image, severely limiting its practical deployment on mobile devices. To address these limitations, we revisit the Multiplane Image(MPI) representation, which represents scenes using a compact set of planar layers for efficient novel view synthesis. Leveraging recent advances in visual foundation models, we utilize predicted point maps for reliable geometric initialization, followed by differentiable optimization. To address the issues of holes and artifacts in sparsely initialized MPI, we introduce one-step diffusion, which participates in both the differentiable optimization of MPI and the postprocessing of rendering results. Compared with a representative GS-based method, our approach is 30.7% faster and uses only 14.8% of its model size, while achieving competitive synthesis quality on front-view scenarios
[CV-69] IDES: Time-Derivative Event Simulation via Deformable Reconstruction
链接: https://arxiv.org/abs/2606.02058
作者: Christopher Thirgood,Dipon Kumar Ghosh,Simon Hadfield
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Event cameras emit asynchronous events in response to environmental appearance changes. The scarcity of real-world event datasets makes simulation essential. However, most simulators infer event timestamps from frame sequences, forcing many threshold crossings to share a small set of discrete times; a failure mode we term timestamp batching that worsens under fast motion and occlusion. We present TIDES, a continuous-time event simulator built on dynamic Gaussian splatting. Because TIDES operates on an explicit 3D scene representation with learnt geometry and motion, it can derive per-pixel intensity dynamics directly from the scene, rather than by differencing rendered frames. This enables accurate threshold-crossing prediction, including multiple crossings per rendering step, without temporal upsampling or frame interpolation. The same 3D scene model reveals where objects partially occlude one another; TIDES uses this to guide adaptive time stepping, concentrating computation only in regions where occlusion dynamics make simple models of brightness change unreliable. Finally, we model finite sensor bandwidth using a tile-level arbiter whose throughput, jitter, and event drops reproduce realistic sensor artifacts. Across paired RGB-event benchmarks, TIDES attains state-of-the-art event-stream fidelity. We also show that events simulated by TIDES transfer more effectively to real downstream tasks than competitors’. Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO) Cite as: arXiv:2606.02058 [cs.CV] (or arXiv:2606.02058v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.02058 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-70] opological texture analysis of microscopy images of dynamic casein gelation and its relation to rheological properties
链接: https://arxiv.org/abs/2606.02048
作者: Zahra Tabatabaei,Diana Soto Aguilar,Jose C. Bonilla,Mathias P. Clausen,Jon Sporring
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph)
备注:
Abstract:We propose a novel computational toolbox that integrates Topological Data Analysis (TDA), Differential Box Counting (DBC), Multifractal Partition (MFP), and Local Binary Patterns (LBP), applied to time-lapse super-resolution STED microscopy images of sodium caseinate gelation induced by glucono-delta-lactone (GDL) at 30 °C and 40 °C and two GDL concentrations (1.8% and 3.5% w/v). TDA tracked topological loops, closed ring-like structures reflecting protein network interconnectivity, via max-Betti-1 curves, which revealed a lag phase of dispersed aggregates, a sharp decay coinciding with network percolation and the rheologically observed sol-gel transition, and a post-gelation increase corresponding to network rearrangements. These topological transitions were corroborated by DBC and MFP as these methods were able to resolve changes in structural complexity and spatial heterogeneity. The toolbox was validated on simulated fractal images prior to experimental application. Together, these descriptors provided sensitivity to subtle microstructural transitions that bulk rheology captured as averaged bulk mechanical responses. This integrated approach provides a robust quantitative tool for characterizing complex microstructure in food and material science with evolving microstructural dynamics. Code is available at this https URL Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Biological Physics (physics.bio-ph) Cite as: arXiv:2606.02048 [cs.AI] (or arXiv:2606.02048v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.02048 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-71] Attention mechanisms and transfer learning for robust peach leaf damage classification under domain shift
链接: https://arxiv.org/abs/2606.02045
作者: Adrián Cánovas-Rodriguez,Miguel A. González-Illán,Maria Fernanda García-Cruz,Pedro Nortes Tortosa,José Salvador Rubio-Asensio,Miguel A. Zamora Izquierdo,Juan Antonio Martínez Navarro,Antonio F. Skarmeta
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial intelligence provides a practical framework for crop damage assessment from imagery data, supporting early decision-making in agricultural management. In peach orchards, climate change increases abiotic stress and biotic pressures, including pests and diseases, which often produce visually similar foliar symptoms. This overlap makes manual diagnosis difficult, especially across multiple fields with varying environmental conditions, highlighting the need for automated models with strong generalization ability. We propose an image-based classification approach for peach leaf damage detection. A benchmark dataset was created through manual annotation of publicly available images, consisting of 1,366 peach leaves across six damage categories. Several deep learning architectures were evaluated. EfficientNet models achieved the best results, with EfficientNetB0 reaching 92.9 percent accuracy, EfficientNetB3 achieving 91.5 percent, and EfficientNetB5 showing the strongest performance on minority classes. DenseNet121 reached 92.6 percent accuracy. The integration of the Convolutional Block Attention Module (CBAM) improved performance in several backbones, particularly EfficientNetB5 and InceptionV3, while showing limited or negative impact in others. The CBAM-enhanced EfficientNetB5 achieved the best overall accuracy of 93.3 percent. To evaluate robustness under realistic conditions, a local dataset of 180 images across four classes was collected, and transfer learning strategies were applied to address domain shift. Three fine-tuning strategies were tested. EfficientNetB3 combined with CBAM achieved the best performance in the local domain, reaching a 93 percent macro F1-score after transfer. Overall, attention-based models showed improved robustness for minority classes and better generalization across different field conditions. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.02045 [cs.CV] (or arXiv:2606.02045v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.02045 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-72] Normality-Preserving Continual Industrial Anomaly Detection via Orthogonal LoRA Banks
链接: https://arxiv.org/abs/2606.02042
作者: Weibai Fang,Haijun Che,Feiyang Ren,Qiancheng Lao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages,6 figures,Submitted to Advanced Engineering Informatics
Abstract:Continual industrial anomaly detection with diffusion models suffers from historical normality prior drift and catastrophic forgetting. Existing continual diffusion methods preserve previous knowledge through replay or constrained optimization, but they lack an explicit mechanism for isolating and protecting category-specific normality priors during sequential adaptation. Although low-rank adaptation provides modular residual updates, standard LoRA neither freezes historical normality subspaces nor prevents new adapters from interfering with previous ones. To address this issue, we propose a normality-preserving continual anomaly detection framework based on two modules: History Frozen Orthogonal LoRA Bank (HF-OLB) and Hierarchical Novelty Adaptive Bank Growth module (HNABG). HF-OLB freezes both the pre-trained U-Net backbone and the learned LoRA banks, and constrains new task-specific normality residuals to the orthogonal complement of historical LoRA subspaces. HNABG further allocates layer-dependent residual capacity and expands the bank only when the residual normality novelty exceeds the expressive capacity of existing banks. Extensive experiments on MVTec and VisA demonstrate the effectiveness of the proposed method. On the challenging VisA 2x6 setting, our method achieves 83.6/91.8 image and pixel level A-AUROC with 3.8/3.9 FM, improving pixel level A-AUROC over the state of the art by 3.2 points while reducing pixel level FM by 1.3. These results show that our method effectively preserves historical normality priors in long horizon continual category sequences.
[CV-73] Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association
链接: https://arxiv.org/abs/2606.02022
作者: Matvei Shelukhan,Timur Mamedov,Aleksandr Chukhrov,Karina Kvanchiani
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.
[CV-74] PerBite: A Curated Diagnostic Workflow for Bite-Aware Food Volume Estimation
链接: https://arxiv.org/abs/2606.02021
作者: Ahmad AlMughrabi,Farid Al-Areqi,David Fernández Gómez,Umair Haroon,Marc Bolaños,Ricardo Marques,Petia Radeva
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Can a visually plausible food mesh be trusted to estimate the volume of consumed food? \method investigates this question using selected paired before- and after-consumption states from the MetaFood CVPR 2026 Continuous 3D Reconstruction While Eating Challenge. The submitted workflow follows a curated reconstruction protocol: SAM~3 segments the food and plate regions; Hunyuan3D/SAM~3D generates a dimensionless food mesh; the plate diameter provides the metric scale; the plate geometry is removed in Blender; and the remaining mesh is hole-filled, made watertight, and integrated to estimate volume. MoGe-2 is used only as an auxiliary cue for initial dish-diameter estimation when direct plate measurement is uncertain; it is not the primary scale source for the reported challenge result. \method ranks first, with an average Chamfer distance of 8.31 across 34 meshes using rigid ICP without scale correction. On 17 before- and after-pairs, it achieves 33.87% state-level volume MAPE and zero monotonicity violations, while consumed-volume MAPE remains 53.74%. The results show that surface reconstruction, metric scale, controlled mesh cleanup, watertight volume integration, and physical depletion consistency should be evaluated separately for dietary assessment. Source code and evaluation scripts will be available at \hrefthis https URLthis http URL.
[CV-75] Distortion-Aware Fusion of Statistical and Vision-Language Features for Blind Image Quality Assessment
链接: https://arxiv.org/abs/2606.02002
作者: Bishr Omer Abdelrahman Adam,Xu Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Blind image quality assessment (BIQA) aims to predict perceived image quality without access to a reference image. Classical natural scene statistics (NSS) descriptors and modern vision-language model (VLM) embeddings address this problem from fundamentally different perspectives, yet whether combining them yields complementary benefits and how to weight their contributions per input image remains unexplored. We propose a distortion-aware fusion framework that integrates a 138-dimensional NSS descriptor with two complementary VLM embeddings, SigLIP and CLIP-H, through a multiplicative gating mechanism that learns per-input stream weights conditioned on image content. Unlike static concatenation fusion, the proposed gating network suppresses or amplifies each stream’s contribution based on the input, producing weights that correlate positively (Spearman rank correlation rho=0.33) with the per-distortion NSS contribution measured by independent ablation on KADID-10k. The framework requires no end-to-end fine-tuning of the VLM backbones and is trained with a hybrid loss combining mean squared error, Pearson linear correlation, and pairwise ranking objectives. We evaluate on three standard benchmarks: KonIQ-10k (SROCC=0.9142, PLCC=0.9279), KADID-10k (SROCC=0.9715, PLCC=0.9733, surpassing recent state-of-the-art methods), and LIVE Challenge in-the-Wild (SROCC=0.8527, PLCC=0.8802 with cross-dataset pretraining and fine-tuning). A per-distortion analysis on KADID-10k reveals that NSS features contribute most on noise and color-shift distortions where pixel statistics are directly affected, and least on perceptual distortions such as color saturation changes. The learned gate values validate these findings, confirming that the model autonomously discovers distortion-stream affinity patterns consistent with the manual per-distortion study.
[CV-76] owards 3D-Aware Video Diffusion Models: Render-Free Human Motion Control with Mesh Tokenization
链接: https://arxiv.org/abs/2606.02000
作者: Jingyun Liang,Min Wei,Shikai Li,Yizeng Han,Hangjie Yuan,Lei Sun,Weihua Chen,Fan Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Project page: this https URL
Abstract:Diffusion models have shown remarkable success in video generation. However, whether such models are truly aware of the 3D structure underlying visual observations, rather than simply reproducing plausible 2D projections, remains an open question. In this work, we investigate this question through human motion control, a task that requires precise modelling of 3D human geometry, motion, camera viewpoint, and scene context. Unlike prior methods that rely on rendered 2D motion guidance videos, we propose a render-free framework that conditions video generation directly on compressed 3D human mesh tokens. This representation preserves full 3D geometric information while enabling a unified token-based generation pipeline that processes video tokens jointly with motion tokens in a DiT-based architecture. This design requires the model to reason jointly about appearance, 3D structure, and camera viewpoint during video generation. Experimental results demonstrate strong performance on human motion control benchmarks, while reducing artifacts induced by view-dependent 2D guidance and trajectory-pose mismatches during editing. These findings suggest that video diffusion models, when equipped with mesh tokenization, can better capture complex 3D human structures and their interactions with the surrounding environment.
[CV-77] A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision
链接: https://arxiv.org/abs/2606.01992
作者: Stefano Samele,Eugenio Lomurno,Teodora Jovanovic,Sanjay Shivakumar Manohar,Alberto Crivellaro,Matteo Matteucci
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model’s I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.
[CV-78] MT-EditFlow: Reinforcement Learning for Multi-Turn Image Editing with Flow Matching
链接: https://arxiv.org/abs/2606.01985
作者: Jiahui Huang,Yasi Zhang,Tianyu Chen,Shu Wang,Jianwen Xie,Oscar Leong,Mingyuan Zhou,Nanzhu Wang,Ying Nian Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent breakthroughs in instruction-based image editing have captured significant attention, as models are now capable of handling real-world editing demands with the practicality required by everyday users. However, editing models trained primarily for single-turn edits often break down in multi-turn editing–the natural interactive setting where a user iteratively refines an image based on the model’s own previous outputs. This failure stems from the all-or-nothing requirement, where a single failed turn compromises the entire sequence, and error propagation, where exposure bias leads to compounding editing errors. To address these challenges, we introduce MT-EditFlow, a flow-matching reinforcement learning framework designed to optimize reward signals for sequential image editing. MT-EditFlow integrates a multi-turn perspective with a multi-reward formulation to provide a unified structure applicable to both GRPO and NFT-based reinforcement learning methods. We systematically analyze and optimize the reward signal by investigating effective scoring strategies for turn-level aggregation, VLM reasoning modes to trade off reward bias and variance, and advantage fusion levels to prevent reward hacking. Our findings reveal that broadcasting the aggregated advantage across the entire editing trajectory effectively bridges the gap between local planning and global multi-turn task success. Extensive experiments demonstrate that MT-EditFlow significantly improves performance across diverse base models. Notably, it boosts FLUX.1-Kontext-dev by 6.85 points in turn-3 overall performance, surpassing state-of-the-art open-source models such as Qwen-Image-Edit. By maintaining high marginal success rates and reducing exposure bias, MT-EditFlow provides a foundation for more reliable and natural human-AI collaboration in visual content creation.
[CV-79] Generalization Limits in Vehicle Re-Identification
链接: https://arxiv.org/abs/2606.01981
作者: Anis Yassine Ben Mabrouk(CB),Antoine Tadros(CB),Rafael Grompone von Gioi(CB),Gabriele Facciolo(CMLA, LIGM),Axel Davy(CB),Rodrigo Verschae
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vehicle re-identification focuses on retrieving images of the same vehicle from a gallery given a query image. Upon closer inspection of commonly used datasets, we observe that vehicles with few visual differences-e.g., the same make, model, and color-appear in both the training and test sets. As a result, methods that effectively memorize the training data tend to perform well on these test sets but struggle to generalize to other datasets. In this paper, we address this issue by proposing a novel evaluation approach that more effectively measures generalization capability to unseen vehicle types. To further study generalization performance, we also propose splitting the evaluation based on view, allowing us to differentiate the effect of viewpoint robustness from that of same-view re-identification. Our findings reveal that most state-of-the-art methods struggle with unseen vehicle types, and that their robustness to viewpoint changes and attention to detail are limited to vehicle types seen during training.
[CV-80] A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation
链接: https://arxiv.org/abs/2606.01973
作者: Zefeng Li,Evan Shelhamer
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: TMLR 2026
Abstract:Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ‘‘garlic bread’’ vs. ‘‘hot dog’’ for food, or ‘‘highway’’ vs. ‘‘dam’’ for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ‘‘cracked’’ mud, ‘‘porous’’ sponge, ‘‘veined’’ leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method’s own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.
[CV-81] Contrastive Augmented Transformer with Domain-specific Enhancement for Robust Multi-scenario Metal Surface Defect Detection
链接: https://arxiv.org/abs/2606.01962
作者: Yiyao Liua,Wenxiao He,Liyuan Ren,Huan Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Metal surface defect detection is critical for maintaining product quality in industrial manufacturing. However, it faces significant challenges, including limited annotated data, difficulty in identifying subtle multi-scale defects, and poor generalization across diverse scenarios. To address these issues, this paper proposes a novel Contrastive Augmented Transformer (CAT) framework for robust defect detection. CAT employs a hierarchical Swin Transformer backbone and redesigns the feature pyramid network to effectively fuse low-level textures with high-level semantics, enabling precise modeling of subtle and multi-scale defect patterns. To enhance robustness under real-world noise conditions, we propose a domain-specific droplet augmentation algorithm. Furthermore, we incorporate a hard negative mining strategy into the contrastive loss to strengthen the model’s discrimination ability in ambiguous defect regions. Experimental results on the KolektorSDD2 dataset demonstrate that CAT achieves a pixel-level AUROC of 99.54%, outperforming existing methods. In addition, CAT exhibits superior generalization and robustness on three unseen datasets, including KSDD1, MTD for tile defects, and MSDD for rail surface defects, demonstrating its potential for wide-scale industrial deployment.
[CV-82] WALL-WM: Carving World Action Modeling at the Event Joints
链接: https://arxiv.org/abs/2606.01955
作者: Shalfun Li,Victor Yao,Charles Yang,Truth Qu,Regis Cheng,Ryan Yu,Howard Lu,Newton Von,Vincent Chen,Yohann Tang,Maeve Zhang,Ellie Ma,Gody Li,Sage Yang,Lorien Shu,J.W. Gao,Ethan Chen,Colin Ye,Yu Sun,Elise Mon,PS Zhang,Neo Li,Lily Li,James Wang,Ping Yang,Chris Pan,Lucy Liang,Hang Su,Roy Gan,Hao Wang,Qian Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.
[CV-83] Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects
链接: https://arxiv.org/abs/2606.01950
作者: Jens U. Kreber,Lukas Mack,Joerg Stueckler
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.
[CV-84] Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks
链接: https://arxiv.org/abs/2606.01947
作者: Nermeen Abou Baker,David Rohrschneider,Uwe Handmann
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published by the Machine Learning and Knowledge Extraction Journal
Abstract:Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention–explored here for the first time–achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.
[CV-85] Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization
链接: https://arxiv.org/abs/2606.01945
作者: Yumiao Zhao,Bo Jiang,Beibei Wang,Xixi Wan,Xiao Wang,Jin Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbfLow-\textbfRank visual \textbfSpike \textbfPrompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emphsparse visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.
[CV-86] SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation
链接: https://arxiv.org/abs/2606.01940
作者: Can Zhang,Gim Hee Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.
[CV-87] SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video ICRA2026
链接: https://arxiv.org/abs/2606.01939
作者: Howard Huang,Bharath Surianarayanan,Keifer Lee,Chenyu Wang,Chen Feng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE ICRA 2026
Abstract:Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55,m by 7,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8,cm with respect to ground-truth.
[CV-88] Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning
链接: https://arxiv.org/abs/2606.01935
作者: Ziyang Yao,Zeyu Zhu,YunCheng Jiang,Zibin Guo,Huijing Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.
[CV-89] 3rd Place at CVPR 2026 CASTLE Challenge: Agent ic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval
链接: https://arxiv.org/abs/2606.01933
作者: Raghad Albusayes,Munirah Alyahya
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at this https URL.
[CV-90] Pool-Select-Refine: Allocation-Aware Generative Dataset Distillation with Soft-Label-Guided Latent Refinement
链接: https://arxiv.org/abs/2606.01920
作者: Wenmin Li,Shunsuke Sakai,Zhongkai Zhao,Tatsuhito Hasegawa
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion-based dataset distillation has recently emerged as a promising paradigm for condensing large-scale datasets into compact synthetic sets. By leveraging pretrained generative priors, these methods can produce realistic class-conditional samples more efficiently than traditional matching-based approaches. However, most existing diffusion-based methods still adopt a rigid Generate-and-Use'' strategy, where the generated samples are directly treated as the final distilled set under a fixed images-per-class budget. Such a design tightly couples candidate generation with final budget allocation, which may result in redundant waste of the limited budget or insufficiently informative samples. In this paper, we propose Pool-Select-Refine’', a two-stage framework for allocation-aware generative dataset distillation. First, instead of directly using a fixed number of generated samples, we construct an over-complete candidate pool and select a compact subset under the target budget. Second, we refine the selected samples in latent space using soft-label supervision derived from the teacher model, improving semantic alignment while preserving the generative prior. This design explicitly decouples generation, selection, and refinement, enabling more effective use of the distillation budget. Experiments on large-scale and fine-grained image classification benchmarks show that the proposed framework delivers consistent gains over diffusion-based baselines. The results suggest that introducing a curation stage before refinement is a simple yet effective way to improve diffusion-based dataset distillation.
[CV-91] Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering CVPR2026
链接: https://arxiv.org/abs/2606.01911
作者: Dongxing Mao,Jinpeng Wang,Jiahao Tang,Kevin Qinghong Lin,Linjie Li,Zhengyuan Yang,Lijuan Wang,Min Li,Jingru Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 poster
Abstract:Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at this https URL
[CV-92] Single-Line Drawing Generation via Semantics-Driven Optimization
链接: https://arxiv.org/abs/2606.01910
作者: Tanguy Magne,Alexandre Binninger,Ruben Wiersma,Olga Sorkine-Hornung
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, published in Computer Graphics Forum 2026
Abstract:Line drawings are a highly expressive art form that requires the artist to abstract and distill the essence of their subject. We present the first semantics-driven method for automatically generating single-line drawings in vector format, guided either by a text prompt describing the concept or an input image depicting it. Our approach leverages score distillation sampling to optimize the parameters of a uniform rational B-spline (URBS) curve, ensuring that the drawing consists of a single continuous stroke by design. This representation provides fine-grained control over the level of detail, while additional loss terms allow us to steer the final artistic style. We demonstrate that our method outperforms state-of-the-art text-to-image models and optimization pipelines for this task, producing results that are both more aesthetically pleasing and more faithful to the style of continuous line drawing artists. Furthermore, because our method generates a vectorized curve, it directly supports downstream fabrication processes such as embroidery, laser engraving and wire bending. Our code and results are available at this https URL.
[CV-93] Private and Stable Test-Time Adaptation with Differential Privacy ICML2026
链接: https://arxiv.org/abs/2606.01908
作者: Zefeng Li,Qiaoyue Tang,Mathias Lecuyer,Evan Shelhamer
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.
[CV-94] Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation
链接: https://arxiv.org/abs/2606.01900
作者: Muhammed Burak Kizil,Enes Sanli,Niloy J. Mitra,Xuelin Chen,Erkut Erdem,Aykut Erdem,Duygu Ceylan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods
[CV-95] rain Test Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
链接: https://arxiv.org/abs/2606.01896
作者: Atmika Bhardwaj,Silvia Vock,Nico Steckhan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 4 figures
Abstract:Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.
[CV-96] Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations
链接: https://arxiv.org/abs/2606.01895
作者: Xingyu Qu,Wenxuan Zhang,Peng Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.
[CV-97] Adversarial Attacks on Robot Localization Systems via Deep Feature Perturbation
链接: https://arxiv.org/abs/2606.01892
作者: Zhenyu Li,Tianyi Shang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11page
Abstract:Robot localization systems are critical for autonomous navigation and safety. Adversarial perturbations can mislead these systems, resulting in mislocalization, navigation errors, or unsafe interactions, especially in mission-critical scenarios. This paper investigates the vulnerability of deep learning based localization pipelines to adversarial attacks. We propose a novel framework for generating adversarial queries that specifically target Product Quantization (PQ) in visual localization systems. Our method employs a Lightweight Product Quantization Network (LPQN) to perturb query feature encodings, misleading the retrieval process by returning semantically irrelevant database entries. Adversarial queries are generated via a two-phase procedure: a forward pass that perturbs feature distributions and a backward pass that refines the perturbation through optimization. The lightweight design of LPQN allows the creation of subtle yet highly effective perturbations with minimal computational overhead. Extensive experiments in both controlled and real-world robotic environments demonstrate that our approach substantially degrades PQN performance, exposing critical vulnerabilities in practical applications.
[CV-98] Divide and Conquer: Reliable Multi-View Evidential Learning for Deepfake Detection ICML2026
链接: https://arxiv.org/abs/2606.01885
作者: Xiaolu Kang,Zhongyuan Wang,Jikang Cheng,Baojin Huang,Zhanhe Lei,Gang Wu,Qin Zou,Qian Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:With the evolution of generative models, deepfakes have achieved near-perfect semantic realism, leaving forensic traces only in subtle structural anomalies. However, existing single-view paradigms often fail to generalize, as dominant semantic features overwhelm subtle artifact cues within entangled representations. This imbalance leads to overconfident yet brittle predictions – a phenomenon we term the Semantic Masking Effect. To address this challenge, we propose a reliable framework called Divide-and-Conquer Multi-View Evidential Learning (DiCoME) for Deepfake Detection. In the “Divide” phase, we employ Geometric View Purification to decompose the entangled representation space through principled geometric projection. This process suppresses semantic interference within artifact-sensitive representations, forming the foundation for decorrelated yet complementary semantic and artifact views. In the “Conquer” phase, we leverage Uncertainty-Aware Evidential Learning to synthesize these distinct views. By explicitly modeling the “epistemic conflict” between semantic and artifact cues, this mechanism provides calibrated uncertainty estimates instead of forcing rigid deterministic decisions. Extensive experiments across multiple benchmarks demonstrate that our method consistently outperforms existing approaches in generalization performance, while providing reliable uncertainty estimation for trustworthy deepfake detection. Code is available at this https URL.
[CV-99] Beyond the Simplex: Balanced Prototype Geometry for Scorer-Agnostic Open-Set Recognition
链接: https://arxiv.org/abs/2606.01883
作者: Mayank Sharma,Rohit Kumar Mourya
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 2 figures, 6 tables
Abstract:Open-set recognition (OSR) requires a classifier to reject inputs from unseen classes which is essential in safety-critical settings such as medical imaging. Simplex based methods, which fix class prototypes at the vertices of a regular simplex and then reject via a distance-ratio score, perform well empirically but lack theoretical justification, and existing analysis applies only when the embedding dimension d is at least C-1, which is the regime in which a regular simplex exists. We give a theoretical account of simplex-ratio OSR that holds in every embedding dimension, including d C-1. Our analysis centers on balanced equal-norm codes: prototype configurations with equal lengths and zero sum, which exist for all d = 2 and include the regular simplex as a special case. For these codes we show that an auxiliary squared ratio score has sublevel sets that are exact unions of Euclidean balls, which in turn bracket the acceptance region of the operational score; and we prove a sharp dichotomy: the prototypes attain one-distance symmetry, behaving like a regular simplex, if and only if d = C-1, with controlled degradation governed by an explicit defect parameter below that threshold. We further show the false-acceptance rate decays exponentially in d under natural isotropy assumptions, and that the operational score is globally Lipschitz with compact acceptance regions. Empirically, we study balanced prototype geometry as both an analytic tool and a representation-learning prior, rather than as a stand-alone state-of-the-art detector. Across CIFAR and MedMNIST open-set splits, the geometry provides useful structure, but OSR performance remains strongly dependent on the scoring rule: raw ratio scores typically underperform nearest-neighbor and logit-based alternatives. Comments: 20 pages, 2 figures, 6 tables Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.01883 [cs.LG] (or arXiv:2606.01883v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.01883 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-100] Deep Learning for Generating Computational PIN-4 Immunohistochemistry Staining from Prostate Biopsy HE Images
链接: https://arxiv.org/abs/2606.01871
作者: Vietbao Tran,Pratik Shah
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Immunohistochemistry (IHC)is frequently used to resolve diagnostically ambiguous prostate cancer biopsy findings on hematoxylin and eosin (HE)-stained tissue. However, PIN-4 IHC staining is typically performed on adjacent tissue sections, limiting direct spatial comparison between the HE morphology and the corresponding immunophenotypic signal. A paired, registered HE/PIN-4 dataset was constructed from routine clinical prostate biopsy whole-slide images (WSIs), and a conditional generative adversarial network (cGAN) was trained to synthesize PIN-4 staining patterns directly from native HE image patches. The final dataset comprised 172 paired WSIs from 93 patients and 27,298 registered 1024x1024 patch pairs, spanning adenocarcinoma-positive and benign cases with representation across age, race, and ethnicity groups. The model was evaluated on a held-out test set of 1,814 patch pairs from 17 WSIs, achieving a mean peak signal-to-noise ratio (PSNR) of 21.88 dB, structural similarity index measure (SSIM) of 0.667, Pearson correlation coefficient (PCC) of 0.684, and learned perceptual image patch similarity (LPIPS) of 0.417. Qualitative review by a board-certified pathologist showed that generated images captured diagnostically relevant PIN-4 staining patterns, including AMACR/racemase expression and basal-cell-associated staining, while preserving spatial correspondence with the source HE morphology. Accuracy of synthesis varied across morphologically complex regions, including high-grade carcinoma and intraductal carcinoma. These results support the feasibility of supervised PIN-4 synthesis from routinely acquired brightfield HE prostate biopsy images. The approach enables direct interpretation of predicted PIN-4 marker patterns in the context of the source prostate HE architecture, addressing a current spatial limitation of conventional adjacent-section IHC.
[CV-101] Polaris: Scaling Up Instruction-Guided Image Generation Towards Millions of Personalized Style Needs
链接: https://arxiv.org/abs/2606.01858
作者: Zhi-Kai Chen,Jun-Peng Jiang,Jun-Jie Tao,De-Chuan Zhan,Han-Jia Ye
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Users increasingly expect image generation models to quickly adapt to highly diverse and personalized requirements, such as producing images with distinctive styles or characteristics. Traditional approaches rely on fine-tuning, which is costly and difficult to scale. To cope with these limitations, the community has accumulated a growing library of fine-tuned modules and adapters, where each component targets specific generation needs and collectively serves as a foundation for handling new demands. This naturally raises a question: instead of repeatedly training new models, can we systematically exploit this expanding ecosystem to better fulfill user instructions? To this end, we present Polaris, an intelligent retrieval framework that automatically selects and integrates suitable models from the model library based on a user’s instructions. The key insight is that harnessing such a massive and heterogeneous pool requires not only finding the most relevant modules among thousands of candidates, but also aligning them effectively for instruction-driven generation and editing. Polaris addresses this challenge by indexing over 6,500 checkpoints and 75,000 adapters, and retrieving the most relevant components given a user’s input and instruction. In doing so, it delivers scalable, controllable, and well-aligned generation – without any additional training.
[CV-102] RescueBench: Can Embodied Agents Save Lives in the Wild ?
链接: https://arxiv.org/abs/2606.01848
作者: Kui Wu,Beiyu Guo,Hao Chen,ShuHang Xu,Yuling Li,Yongdan Zeng,Zhoujun Li,Yizhou Wang,Fangwei Zhong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Search-and-rescue (SAR) requires embodied agents to explore unfamiliar environments under multimodal uncertainty, perform multi-stage interactions, and retrieve spatial memory over long horizons. Existing benchmarks typically evaluate these capabilities in isolation, leaving unclear how failures compound when they must be composed in realistic workflows. We introduce RescueBench, a photo-realistic diagnostic benchmark that instantiates SAR as a four-stage pipeline: multimodal exploration, target rescue, memory-guided return, and final handoff. By combining sequential task composition with stage-level evaluation, RescueBench enables analysis of how exploration and memory failures propagate through embodied rescue workflows. It contains five progressive difficulty levels that vary in environmental complexity, clue ambiguity, and spatial hierarchy, along with an automatic episode generation and annotation pipeline for scalable evaluation and training. We evaluate seven baselines, an oracle reference, and human players, showing that no baselines complete the full task at the greatest difficulty. Stage-level diagnosis identifies autonomous exploration as the dominant failure mode and spatial memory as a second, independent bottleneck, suggesting that these limitations are not resolved by current topological visual-language navigation or map-based methods. Code is available in this https URL
[CV-103] Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection
链接: https://arxiv.org/abs/2606.01843
作者: Yihui Wang,Yonghui Yang,Jilong Liu,Fengbin Zhu,Le Wu,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and suppress such shortcuts in learned representations. In this work, we propose Shortcut Subspace Suppression (S^3) framework that explicitly characterizes and suppresses method-specific shortcuts via subspace modeling. Our key insight is that variations distinguishing different forgery methods capture method-specific artifacts and thus serve as an effective proxy for method-specific shortcuts. To this end, we train a lightweight linear probe for forgery method classification and perform Singular Value Decomposition (SVD) to extract the dominant shortcut subspace. Building on this formulation, we develop two complementary strategies to reduce shortcut reliance. During training, we softly suppress the shortcut subspace in feature representations, encouraging the model to rely on more generalizable cues for real/fake discrimination. At inference time, we introduce a training-free counterpart that attenuates neurons aligned with the identified shortcut directions, enabling plug-and-play generalization enhancement with improved interpretability. Extensive experiments on multiple benchmarks demonstrate that our method significantly improves cross-method generalization while maintaining strong in-domain performance. The code will be released upon acceptance of the submission.
[CV-104] Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition
链接: https://arxiv.org/abs/2606.01834
作者: Chinthaka Ranasingha,Tharindu Fernando,Sridha Sridharan,Clinton Fookes,Harshala Gammulle
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture motion dynamics from CSI measurements, thereby increasing model complexity and reducing efficiency. Instead, we argue that incorporating appropriate inductive biases tailored to the physical characteristics of CSI signals enables more efficient and effective learning. In this work, we propose a compact temporal convolutional network (TCN)-based framework that explicitly incorporates motion-aware inductive biases into feature learning. Specifically, we introduce a Doppler-energy-guided temporal attention mechanism in feature space to emphasize motion-salient time segments, and a variance-driven channel attention module to weight informative subcarriers based on temporal motion statistics adaptively. By integrating these domain-specific priors, the proposed model effectively captures motion dynamics without increasing architectural depth. Extensive experiments on multiple benchmark datasets demonstrate that our approach achieves superior performance compared to deeper baselines, while significantly reducing parameter count and computational cost.
[CV-105] ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search
链接: https://arxiv.org/abs/2606.01825
作者: Zequn Xie,Xibei Jia,Sihang Cai,Shulei Wang,Tao Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 pages, 5 figures
Abstract:Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.
[CV-106] Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios
链接: https://arxiv.org/abs/2606.01822
作者: Mingxiao Wang,Xiaozhen Qu,Bolin Gao,Tong Wang,Lei He
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 figures, 3 tables
Abstract:Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.
[CV-107] Hist2Style: Histogram-Guided Stylization with Bilateral Grids WWW
链接: https://arxiv.org/abs/2606.01819
作者: Dekel Galor,Adam Pikielny,Zhoutong Zhang,Ke Wang,Laura Waller,Jiawen Chen,Ilya Chugunov
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 10 pages, 8 figures. Extended results are at this https URL
Abstract:Photorealistic style transfer aims to match the color and tone of an input image to that of a style target while preserving the content and details of the original scene. Although existing large image models can facilitate these kinds of appearance edits, their high computational demands, potential for hallucinations, and limited user control make them unsuitable for high-resolution, real-time workflows. We introduce Hist2Style, a bilateral-grid formulation for fast, edge-aware stylization that preserves visual fidelity by constraining operations to locally affine transforms in bilateral space. Our model distills a large image editing model into a lightweight network by training on a large supervised corpus generated with language and vision-language models, targeting spatially varying color edits. The network conditions on a histogram-based embedding of the style target to provide an interpretable interface for adjusting the output style by modifying the target color distribution. Overall, Hist2Style maintains content structure by construction, avoids hallucinations, and supports real-time, high-resolution photorealistic stylization with interactive user-controllable color and tone adjustments.
[CV-108] Unsupervised Collaborative Domain Adaptation for Driving Scene Parsing
链接: https://arxiv.org/abs/2606.01818
作者: Jiahe Fan,Shaolong Shu,Mingjian Sun,Tiehua Zhang,Bohong Xiao,Hanli Wang,Rui Fan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable driving scene parsing is a fundamental capability for autonomous vehicles operating in open and dynamic driving environments. However, adapting perception models to new deployment domains remains challenging because pixel-level annotations are expensive to obtain, while source-domain data are often inaccessible due to privacy, security, or ownership constraints. Existing source-free unsupervised domain adaptation methods typically rely on a single pre-trained source model, which makes the adapted perception system vulnerable to source-specific biases and limits its robustness under diverse road layouts, illumination conditions, weather patterns, and traffic conditions. This article presents an unsupervised collaborative domain adaptation (UCDA) framework for driving scene parsing in a source-free setting, which transfers complementary knowledge from multiple pre-trained source models to a unified target model without accessing any original source samples. To compare predictions from independently trained models, UCDA constructs a class-level prototype memory bank and estimates cross-model prediction reliability through prototype similarity, reducing the effect of inconsistent confidence scales across source models. Based on the resulting complementary supervision, UCDA adopts a two-stage transfer strategy: multiple source models are first refined on unlabeled target-domain driving data through collaborative optimization with positive and negative consistency constraints, and their validated expertise is then distilled into a single deployable target model. Comprehensive evaluations on public driving-scene datasets and real-world data collected from an autonomous vehicle platform demonstrate that UCDA effectively consolidates complementary multi-source knowledge, improving target-domain scene parsing reliability and generalization across diverse driving environments.
[CV-109] Personalized 3D Myocardial Infarct Geometry Reconstruction from Cine MRI for Cardiac Digital Twins
链接: https://arxiv.org/abs/2606.01808
作者: Yilin Lyu,Mark YY Chan,Ching-Hui Sia,Lei Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages
Abstract:Accurate 3D geometric characterization of myocardial infarction (MI) is essential for building cardiac digital twins (CDTs) to precisely simulate infarct-related electrophysiology. Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is the clinical reference for locating MI, yet its reliance on contrast agents restricts use in renally impaired patients and limits longitudinal follow-ups. As an alternative, contrast-free cine MRI visualizes abnormal ventricular wall motion, which is highly indicative of the infarcted area. In this study, we propose a novel explicit geometry-motion embedded model to fully automatically reconstruct personalized, simulation-ready 3D MI geometries directly from multi-view cine MRIs. Specifically, we construct a 4D (3D + t) biventricular mesh to explicitly extract and decouple geometry-aware and motion-aware features. We further design a dual-branch module for adaptive geometry-motion fusion to capture spatiotemporal dependencies for mapping infarcted region. Furthermore, we introduce multi-scale supervision utilizing an AHA-17 segment-guided cross-attention mechanism to steer the prediction, ensuring biophysically consistent reconstruction. Experimental results on 225 cine MRIs demonstrated that the proposed 3D MI reconstruction achieved high performance with an average Dice score of 0.678 \pm 0.011. In the downstream in-silico electrophysiological simulation evaluations, the results were highly consistent with the LGE-derived ground truth, highlighting the great potential of the proposed model for contrast-free scar characterization and seamless integration into CDT modeling. The code will be released publicly upon acceptance of the manuscript for publication.
[CV-110] STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
链接: https://arxiv.org/abs/2606.01790
作者: Yuhang Han,Wenzheng Yang,Yujie Chen,Xiangqi Jin,Yaojie Zhang,Siteng Huang,Linfeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language-model-based graphical user interface (GUI) agents have shown broad automation capabilities, yet deployment is bottlenecked by a key-value (KV) cache that grows linearly with interaction steps. For instance, UI-TARS-1.5-7B consumes 76 GB of GPU memory on merely five screenshots, approaching the capacity of mainstream 80 GB accelerators. Existing KV compression methods share two structural assumptions: aggregating visual-token importance into a single shared saliency map, and applying a fixed top-B cutoff to the fused score distribution. Pilot measurements refute both: spatial specialization lives at the attention-subspace level and migrates across layers, while the score distribution drifts in shape along a trajectory. We propose STaR-KV (Spatio-Temporal Adaptive Re-weighting), a training-free KV cache compression framework that calibrates token importance along three axes: (i) subspace-aware scoring driven by online spatial mutual information; (ii) a temporal stability discount that suppresses redundant cache entries from persistently attended subspaces; and (iii) an entropy-derived temperature that adaptively reshapes the score distribution. Across four GUI benchmarks, STaR-KV achieves the strongest average accuracy among state-of-the-art KV compression methods (e.g., GUIKV, SnapKV) at matched budgets, with no compression-stage FLOPs overhead (-0.07%) and cutting peak GPU memory by nearly 40% at a 20% KV-cache budget. Code is available at this https URL.
[CV-111] PlatonicNav: Unveiling Semantic Correspondence in Navigation with Platonic Topological Maps
链接: https://arxiv.org/abs/2606.01788
作者: Junlin Long,Zeyu Zhang,Xu Deng,Yiran Wang,Yue Yang,Luke Borgnolo,Maxwell Twelftree,Yang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Embodied visual navigation, where an agent perceives a complex environment and acts to reach a goal from raw sensory input, underpins a wide range of applications such as household service robotics, assistive robotics, and large-scale autonomous exploration. However, recent attempts to unify vision-and-language navigation (VLN) and object goal navigation (ObjNav) remain at the level of architectural fusion, mixed-task training, and large vision-language pretraining, without examining whether independently trained vision and language encoders may already share a common semantic structure. Moreover, even object-centric topological maps still ground language goals through explicit cross-modal supervision such as CLIP or large vision-language models, leaving open whether such grounding is possible from a purely vision-built map. To address these challenges, we extend the Platonic Representation Hypothesis to embodied navigation and recast vision-only ObjNav, cross-modal ObjNav, and VLN as three different interfaces to the same object-centric semantic manifold. We further introduce PlatonicNav, a training-free framework whose Platonic Topological Map fuses geometric and semantic node distances from a self-supervised visual encoder, and grounds language goals via blind matching without any paired vision-language data. Extensive experiments on simulation benchmarks including HM3D-IIN, OVON, and R2R-CE on MP3D, together with deployment on Unitree Go2, demonstrate that PlatonicNav generalizes across tasks, modalities, and embodiments without explicit cross-modal training. Code: this https URL. Website: this https URL.
[CV-112] PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection
链接: https://arxiv.org/abs/2606.01757
作者: Smit Kadvani,Shriya Gumber,Kriti Faujdar,Harsh Dave
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 1 figures, 8 tables
Abstract:Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.
[CV-113] EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models
链接: https://arxiv.org/abs/2606.01756
作者: Hongyu Lu,Feng Zhang,Wenwei Jin,Huanling Hu,Pengfei Zhang,Yao Hu,Jiawei Li,Shikai Jiang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. 12 pages, 6 figures, 7 tables
Abstract:Large vision-language models (LVLMs) achieve strong performance on image and video understanding tasks, but their inference efficiency is constrained by the large number of visual tokens produced by vision encoders. Most existing visual token compression methods estimate token importance from attention scores or representation properties at specific layers, overlooking how visual tokens evolve across the vision encoder. Such layer-specific criteria may provide incomplete importance estimates and limit performance preservation after compression. To address this issue, we analyze layer-wise visual token evolution directions and observe that tokens form multiple group evolution directions across vision-encoder layers. Our analysis further shows that informative tokens tend to exhibit persistent deviations from common group evolution directions. Based on this observation, we propose EvoCut, a training-free and attention-free visual token compression method that estimates token importance from multi-layer evolution deviation. Experimental results show that EvoCut can retain only 11.1% of the visual tokens on LLaVA-1.5-7B while preserving 94.4% of the average performance, demonstrating its effectiveness in balancing efficiency and accuracy.
[CV-114] Quality-Guided Semi-Supervised Learning for Medical Image Segmentation MICCAI2026
链接: https://arxiv.org/abs/2606.01753
作者: Kumar Abhishek,Ghassan Hamarneh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Early Accept at MICCAI 2026, 13 pages, 2 figures
Abstract:Training accurate medical image segmentation models requires large amounts of densely annotated data, which is costly and time-consuming to obtain. Semi-supervised learning (SSL) alleviates this by learning from both abundant unlabeled data and limited labeled data. However, most modern SSL methods rely on pseudolabels for unlabeled data, and typically assess their reliability through model confidence or uncertainty, measures that are self-referential and lack explicit grounding in segmentation quality. Instead, we propose a quality-guided SSL framework that trains a dedicated network to estimate segmentation quality from image-mask pairs. The predictor is trained on variable-quality masks generated through synthetic corruptions augmented with imperfect outputs from partially trained segmentation models, capturing realistic error patterns encountered during training. We integrate the quality predictor into SSL through two complementary mechanisms: a quality-aware regularization loss and a quality-based pseudolabel sample reweighting scheme. We show that our method serves as a drop-in enhancement to existing SSL frameworks. Extensive experiments across five datasets and multiple architectures demonstrate consistent improvements over competing SSL methods, advancing the state-of-the-art in semi-supervised medical image segmentation.
[CV-115] Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness
链接: https://arxiv.org/abs/2606.01746
作者: Kai Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages including reference, 4 figures
Abstract:Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple \ell_2 distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers’ high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, \ell_2 -classifiers’ insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel \ell_2 -reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of \ell_2 distance. It yields \ell_2 -distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier’s predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.
[CV-116] FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds
链接: https://arxiv.org/abs/2606.01734
作者: Rai Hisada,Kanji Tanaka
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 5 pages, 1 figure, technical report
Abstract:This paper proposes ``FlatVPR,‘’ a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors \mathbfz_A and \mathbfz_B can be accurately reconstructed via linear interpolation \hat\mathbfz_pseudo = (1-t)\mathbfz_A + t\mathbfz_B , where t \in [0,1] denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation \hat\mathbfz = \mathbfz + \textRes(\mathbfz) to the raw foundation features \mathbfz , where \textRes(\cdot) represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.
[CV-117] Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference ICML2026
链接: https://arxiv.org/abs/2606.01711
作者: Hyeonwoo Cho,DongHyeon Baek,Yewon Kim,Bumsub Ham
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.
[CV-118] Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs ICML2026
链接: https://arxiv.org/abs/2606.01710
作者: Afsaneh Hasanebrahimi,Hanxun Huang,Christopher Leckie,Sarah Erfani
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: ICML 2026
Abstract:Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.
[CV-119] JenBridge: Adaptive Long-Form Video Soundtracking across Scene Transitions
链接: https://arxiv.org/abs/2606.01703
作者: Jiashuo Yu,Yao Yao,Boyu Chen,Alex Wang
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.
[CV-120] Spatio-Temporal Correlation Guided Geometric Partitioning for Versatile Video Coding
链接: https://arxiv.org/abs/2606.01701
作者: Xuewei Meng,Chuanmin Jia,Xinfeng Zhang,Shanshe Wang,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Geometric partitioning has attracted increasing attention by its remarkable motion field description capability in the hybrid video coding framework. However, the existing geometric partitioning (GEO) scheme in Versatile Video Coding (VVC) causes a non-negligible burden for signaling the side information. Consequently, the coding efficiency is limited. In view of this, we propose a spatio-temporal correlation guided geometric partitioning (STGEO) scheme to efficiently describe the object information in the motion field of video coding. The proposed method can economize the bits consumed for side information signaling, including the partitioning mode and motion information. We firstly analyze the characteristics of partitioning mode decision and motion vector selection in a statistically-sound way. Based on the observed spatio-temporal correlation, we design a mode prediction and coding method to reduce the overhead for representing the above mentioned side information. The main idea is to predict the STGEO modes and motion candidates that have higher selection possibilities, which can guide the entropy coding, i.e., representing the predicted high-probability modes and motion candidates with fewer bits. In particular, the high-probability STGEO modes are predicted based on the edge information and history modes of adjacent STGEO-coded blocks. The corresponding motion information is represented by the index in a merge candidate list, which is adaptively inferred based on the off-line trained merge candidate selection probability. Simulation results show that the proposed approach achieves 0.95% and 1.98% bit-rate savings on average compared to VTM-8.0 without GEO for Random Access and Low-Delay B configurations, respectively.
[CV-121] MixerSENet: A Lightweight Framework for Efficient Hyperspectral Image Classification
链接: https://arxiv.org/abs/2606.01700
作者: Mohammed Q. Alkhatib,Swalpa Kumar Roy,Ali Jamali
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)
Abstract:In this paper, a novel framework, MixerSENet, is introduced for hyperspectral image (HSI) classification, designed to address the challenges of computational efficiency and limited labeled data. The proposed model processes hyperspectral image patches while maintaining consistent size and resolution throughout the network, effectively decoupling the mixing of spatial and channel dimensions. Notably, MixerSENet is lightweight and computationally efficient, requiring fewer parameters compared to traditional models, making it suitable for resource-constrained environments. A squeeze and excitation block is incorporated into the model to refine feature extraction, enhancing the network’s ability to capture more informative features. Experimental results on two benchmark datasets demonstrate that MixerSENet achieves superior performance, reaching an overall accuracy (OA) of 82.47% on Houston13 dataset and 96.70% on the Qingyun dataset, outperforming state-of-the-art methods including 3D-CNN, HybridKAN, HSIFormer, SimPoolFormer, and MorphMamba. Furthermore, a detailed analysis of computational efficiency shows that MixerSENet achieves a favorable balance between accuracy and efficiency, with only 53,146 parameters and an low inference time, confirming its practicality for real-world applications. At publication, source code will be publicly available at this https URL.
[CV-122] Learning Label-Efficient Interpretable Medical Image Diagnosis via Semi-supervised Hypergraph Concept Bottleneck Model
链接: https://arxiv.org/abs/2606.01698
作者: Yijun Yang,Ruiqiang Xiao,Lijie Hu,Angelica I Aviles-Rivero,Yunzhu Wu,Jing Qin,Lei Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning has revolutionized medical image analysis, delivering exceptional diagnostic accuracy across diverse applications. Yet, the lack of interpretability in its decision-making hinders clinical adoption, particularly in high-stakes medical contexts where transparency is paramount for trustworthiness. For example, in Placenta Accreta Spectrum (PAS), subtle cues in ultrasound imaging challenge reliable diagnosis, rendering black-box models untrustworthy for accurate scoring. To address this, Concept Bottleneck Models (CBMs) offer a promising avenue by embedding clinically meaningful intermediate concepts into the diagnosis pipeline, enabling clinicians to scrutinize and refine model outputs. However, conventional CBMs falter in capturing complex inter-concept dependencies and demand costly, expert-driven concept annotations, limiting their scalability. This study introduces a novel semi-supervised CBM framework designed for medical imaging, which leverages dual-level hypergraph learning to model high-order concept dependencies and generate domain-adaptive pseudo-labels. Our approach achieves superior interpretability and performance by integrating a concept-level hypergraph for enhanced reasoning and an image-level hypergraph for robust pseudo-label generation. Experiments on a newly annotated PAS ultrasound dataset and a breast ultrasound public dataset demonstrate the effectiveness of the proposed concept label-efficient interpretable framework. Its universality is further validated on the dermoscopic image dataset SkinCon. The code is available at this https URL.
[CV-123] Understanding Identity Continuity in Thermal Video through Scene-Level Consistency CVPR2026 CVPR
链接: https://arxiv.org/abs/2606.01694
作者: Wei-Chieh Sun,Gyungmin Ko,Heejae Kwon,Hsiang-Wei Huang,Jenq-Neng Hwang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings
Abstract:Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.
[CV-124] RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection
链接: https://arxiv.org/abs/2606.01689
作者: Pingping Liu,Aohua Li,Yubing Lu,Jin Kuang,Tongshun Zhang,Qiuzhan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures, under review
Abstract:The detection and segmentation of infrared small targets have important application significance in the fields of surveillance and security, maritime rescue and so on. Due to the low occupancy of these targets in long-distance imaging, the mainstream visual state space model is inefficient and difficult to accurately model the target edge. The existing infrared state space models do not deviate from the mainstream visual state space structure framework from the structural properties of infrared small targets. In order to solve this problem, this paper proposes the RPCASSM network based on the model paradigm of robust principal component analysis(RPCA), which aims to design the background state space module(BSSM) and the target state space module(TSSM) by the nature of the infrared small target in the spatial domain. The BSSM aims to use the saliency of spatial heterogeneous signals to design a spatial probe scanning mechanism(SPCM) to model background information. The TSSM designs a deformable prompt scanning mechanism(DPCM) by using the sparsity and local highlight of the target to focus on the deformable space of the target for state space modeling. According to the above design, we effectively solve the problem that the existing mainstream vision state space model is difficult to accurately model the edge structure of infrared small target. Experimental results on the existing benchmark data sets prove the effectiveness of the RPCASSM design. Our code will be made public at \hrefthis https URLRPCASSM.
[CV-125] Restoring Initial Noise Sensitivity in Text-to-Image Distillation via Geometric Alignment ICML2026
链接: https://arxiv.org/abs/2606.01651
作者: Huayang Huang,Ruoyu Wang,Jinhui Zhao,Wei Deng,Daiguo Zhou,Jian Luan,Yu Wu,Ye Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026
Abstract:Generative distillation significantly accelerates text-to-image (T2I) generation by compressing multi-step trajectories into few-step student models while preserving perceptual quality. However, existing methods primarily optimize efficiency and output fidelity, often neglecting critical properties of the original trajectory. In this work, we identify a key missing property: sensitivity to initial noise, whose degradation impairs downstream control methods relying on noise-based optimization and manipulation. We trace this issue to standard distillation objectives that enforce pointwise output alignment, inadvertently flattening the input-output landscape and suppressing the teacher’s local geometric structure. To address this, we propose Geometry-Aware Distillation (GAD), a sensitivity-preserving framework that aligns the local functional behavior of teacher and student models. Specifically, GAD matches Jacobian-vector products with respect to input noise, enabling the student to reproduce the teacher’s differential response to perturbations. Extensive experiments across multiple T2I paradigms and noise-driven control tasks demonstrate that GAD significantly restores sensitivity and improves diversity while maintaining high visual fidelity. Code is available at this https URL.
[CV-126] PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation ICML2026
链接: https://arxiv.org/abs/2606.01649
作者: Weixing Chen,Zhuoqian Feng,Yang Liu,Yexin Zhang,Yifan Wen,Yinghong Liao,Weichao Qiu,Guanbin Li,Liang Lin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 5 figures, accepted by ICML 2026
Abstract:Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.
[CV-127] Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument
链接: https://arxiv.org/abs/2606.01643
作者: Rui Hong,Jana Košecká
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sign Language Production (SLP) is the task of generating avatar sign language motion from natural language text. The quality of the generated motion is typically evaluated by a motion-space Fréchet distance (FID) and back-translation (BT) BLEU score on benchmarks such as How2Sign. Both metrics can improve substantially while the underlying generator fails to faithfully represent the sign language gestures. In this work we propose to evaluate the generated motion at three independent levels: (\tau1) initial-pose conditioning, (\tau2) output diversity, and (\tau3) target faithfulness. We compute these as pairwise-distance ratios using latent representations of a frozen motion autoencoder (MoAE). We evaluate 14 SLP model checkpoints on the How2Sign dataset, including a re-implemented Neural Sign Actors (NSA), and show that \tau3 faithfulness is never attained, while FID varies by nearly two orders of magnitude and is uncorrelated with faithfulness. We show that on the isolated gloss dataset ASL3DWord favorable \tau3 can be attained, hence isolating the size of the sentence-level paired-dataset as the bottleneck. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.01643 [cs.CV] (or arXiv:2606.01643v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.01643 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rui Hong [view email] [v1] Mon, 1 Jun 2026 03:50:36 UTC (994 KB) Full-text links: Access Paper: View a PDF of the paper titled Conditional Collapse in Sign Language Production: A Diagnostic and a Scaling Argument, by Rui Hong and Jana Ko\vseck’aView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-128] Edge-directed geometric partitioning for versatile video coding ICME
链接: https://arxiv.org/abs/2606.01641
作者: Xuewei Meng,Xinfeng Zhang,Chuanmin Jia,Xia Li,Shanshe Wang,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been published in IEEE ICME
Abstract:To improve the coding performance, geometric partition (GEO) was proposed for the upcoming VVC standard. GEO provides 140 partition candidates. The index of optimal GEO mode needs to be signaled explicitly. Considering different structural characteristics of different CUs and the correlation between spatial adjacent blocks and temporal collocated blocks, we propose a GEO mode prediction strategy by constructing a Most Probable Mode (MPM) list to reduce the overhead of GEO index and improve coding efficiency. Based on the observation of the high correlation between the partition mode and object boundaries, an edge-directed geometric partition scheme is proposed to construct the MPM list according to spatio-temporal edge information. The proposed method provides an objective BD-rate gain of 0.58% and 1.00% on average for RA and LDB configurations compared to VTM-6.0. Besides, it also promotes the visual quality of object boundaries.
[CV-129] CanonCGT: Reference-Based Color Grading via Canonical Pivot Representation CVPR2026
链接: https://arxiv.org/abs/2606.01638
作者: Jinwon Ko,Keunsoo Ko,Chang-Su Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 accepted
Abstract:Reference-based color grading aims to reproduce the tonal mood and lighting of a reference while preserving color harmony and scene structure. Existing photorealistic and filter-based methods often produce unstable tone mappings – over-shifting or inconsistently retaining colors – leading to unnatural results. We propose CanonCGT, a two-stage framework built on a canonical pivot – a style-neutral intermediate representation for stable color mapping. The first stage canonicalizes the input by removing intrinsic tonal bias, and the second color-grades it to match the reference style. A dual-phase training scheme, DP-CGT, combines supervised preset learning with self-supervised refinement on unpaired photographs. CanonCGT delivers photorealistic and tonally consistent results across diverse datasets, surpassing state-of-the-art methods in stability and visual fidelity. Our codes are available at \hrefthis https URLthis https URL
[CV-130] Pave-GRPO: Beyond Instantaneous Guidance through Principled Averag e Velocity Decomposition
链接: https://arxiv.org/abs/2606.01636
作者: Pengyang Ling,Jiazi Bu,Yujie Zhou,Yibin Wang,Zhenyu Hu,Zihan Zhang,Yi Jin,Huaian Chen,Yuhang Zang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages,5 figures
Abstract:Post-training via Group Relative Policy Optimization (GRPO) has emerged as a powerful paradigm for aligning flow-based generative models with human preferences. However, the iterative denoising nature of flow models incurs substantial costs when generating group rollouts for policy-gradient updates, compelling existing methods to train with extremely few denoising steps. This temporal sparsity severely restricts preference optimization: reward feedback can only reach a handful of stages per trajectory, leaving the vast majority of intermediate denoising steps without direct supervision and thus compromising alignment granularity. To address this, we propose Pave-GRPO, which reformulates the GRPO objective through Principled average velocity decomposition. Rather than generating expensive high-step rollouts, we maintain efficient few-step group sampling but decompose each coarse transition into an equivalent ensemble of finer sub-trajectories spanning multiple intermediate timesteps. This propagates reward feedback to a denser set of temporal stages for more comprehensive preference alignment without additional generation cost. This design offers two benefits: (i) zero-cost horizon expansion: through the direct reuse of piece-wise group samples and their associated rewards, Pave-GRPO significantly broadens the effective optimization scope under fixed sampling budgets; and (ii) comprehensive temporal supervision: by equivalently decomposing an instantaneous velocity target into a multi-timestep ensemble, it distributes reward signals across more intermediate stages of the denoising process, enabling finer-grained and more thorough preference optimization. Extensive experiments validate that Pave-GRPO effectively advances preference alignment across different reward settings, offering comprehensive performance enhancement.
[CV-131] What to Test Next: Interpretable Coverag e Gap Discovery in Driving VLMs
链接: https://arxiv.org/abs/2606.01624
作者: Abhishek Aich,Sparsh Garg,Vijay Kumar BG,Turgun Yusuf Kashgari,Manmohan Chandraker
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:
Abstract:Driving vision-language models (VLMs) must accurately understand scenes across diverse conditions defined by Operational Design Domains (ODDs), yet verification remains sparse: many slices are missing, making empirical failure rates unreliable. We propose SliceScorer, a deterministic scoring rule for missing-slice recommendation that combines (i) an exposure-based coverage prior to prioritize rare, under-tested regions, and (ii) a neighbor-failure prior that propagates risk from similar tested conditions. SliceScorer is deliberately simple - interpretable, auditable, and conservative - properties essential for safety-critical validation. For stress testing beyond the declared ODD, we embed SliceScorer within SliceNav, an LLM-orchestrated verification pipeline where the model interprets developer queries to select relevant operators (triage, scoring, acquisition, evaluation) and vocabulary extensions, composing verification workflows while keeping all scoring deterministic and auditable. Experiments on three driving VLMs (WiseAD, DriveMM, Cosmos-Reason2-2B) show that SliceNav surfaces high-risk coverage gaps more effectively than prior slice-discovery methods while maintaining diverse recommendations across the condition space. Ablations confirm both scoring components contribute, and qualitative analysis demonstrates end-to-end workflows from developer query to targeted evaluation.
[CV-132] Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation
链接: https://arxiv.org/abs/2606.01621
作者: Muyi Bao,Yuxin Cai,Hang Xu,Zongtai Li,Jinxi He,Jingfan Tang,Chen Lv,Ji Zhang,Yaqi Xie,Wenshan Wang
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages
Abstract:Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on this http URL Page: this https URL.
[CV-133] Real-Time Generation of Streamable Talking Portrait Video with Reference-Guided Deep Compression VAEs CVPR2026
链接: https://arxiv.org/abs/2606.01620
作者: Sicheng Xu,Yu Deng,Shoukang Hu,Yichuan Wang,Yizhong Zhang,Zhan Chen,Jiaolong Yang,Baining Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 (Highlight) Camera ready
Abstract:Video diffusion models have significantly advanced portrait video generation, yet their high computational demands limit their use in interactive applications. This work presents a framework for streamable talking portrait video generation conditioned on speech audio and reference images. Designed meticulously for streaming scenarios, it features a causal video VAE for deep latent compression and an autoregressive latent denoising model. Our causal VAE integrates a variable number of reference images as guidance, allowing the network to focus on dynamic information rather than static appearance, thereby enhancing compression efficacy and reconstruction quality. Additionally, we extend the residual auto-encoding paradigm to improve spatial-temporal causality handling in our VAE. The generator is based on a Rectified Flow Transformer architecture and produces video latents in a blockwise auto-regressive manner. Our method enables the real-time generation of high-quality talking portrait videos, achieving speeds significantly faster than baseline models. Furthermore, comprehensive experiments demonstrate that it is on par with or even outperforms these large models in realism, vividness, and video quality.
[CV-134] uring Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval ACM-MM2025
链接: https://arxiv.org/abs/2606.01615
作者: Xiang Fang,Wanlong Fang,Wei Ji,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published in ACM MM 2025. Address some typos
Abstract:Video-language models are pivotal for tasks such as moment retrieval and highlight detection, yet they often struggle to capture the dynamic, non-linear interactions between temporal video sequences and textual semantics. Existing approaches, relying on static cross-attention or prompt-tuning mechanisms, fail to adaptively model the evolving relationships between modalities, leading to suboptimal alignment and limited generalization. Inspired by systems biology, we propose \textbfReaction-Diffusion Multimodal Fusion (RDMF), a novel framework that reimagines video-language alignment as a reaction-diffusion (RD) process, drawing on the principles of pattern formation introduced by Alan Turing. In RDMF, video features diffuse across time to capture temporal context, while text-video interactions are modeled as non-linear reactions that amplify relevant features and suppress noise, forming emergent patterns akin to biological systems. Leveraging the Gray-Scott RD model, we design a computationally efficient fusion module that integrates video and text representations, supported by rigorous mathematical analysis of stability and convergence using Turing instability criteria. Our framework is theoretically grounded, employing advanced mathematical tools to ensure stable pattern formation, and is practically viable, incorporating standard components like pretrained encoders and DETR-style heads for moment retrieval and saliency prediction. RDMF represents a pioneering interdisciplinary approach, bridging systems biology and multimedia research to address the limitations of conventional multimodal fusion. Preliminary experiments demonstrate its potential to outperform existing methods in identifying salient video moments, offering a new paradigm for video-language tasks.
[CV-135] Self-Improving Small Object Grounding in LVLMs
链接: https://arxiv.org/abs/2606.01612
作者: Tianze Yang,Yucheng Shi,Ruitong Sun,Ninghao Liu,Jin Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 Pages, 15 Figures
Abstract:Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.
[CV-136] Exploiting Semantic and Pixel Representations for Ultra-Low Bitrate Image Compression
链接: https://arxiv.org/abs/2606.01608
作者: Hao Wei,Yanhui Zhou,Chenyang Ge,Saeed Anwar,Ajmal Mian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Most existing extreme compression methods fail to achieve an optimal rate-distortion-perception trade-off, as they typically prioritize perceptual fidelity and visual realism over pixel-level accuracy. Consequently, the resulting reconstructions often deviate noticeably from the originals. Ultra-low bitrate image compression is therefore crucial-not only for producing extremely compact representations but also for ensuring that reconstructed images remain semantically coherent and faithful to the source at the pixel level. To this end, we propose SPRDiff, a diffusion-based compression method that fully leverages both semantic and pixel representations, thereby enhancing reconstruction fidelity under ultra-low bitrate constraints. Specifically, we develop a triple-encoder architecture that utilizes high-fidelity features from the pretrained distortion-oriented and semantic-oriented encoders to compensate for the limited representations extracted by the frozen VAE encoder, thereby improving latent compression and entropy modeling. To further enhance the reconstruction fidelity of diffusion models, we introduce a distortion-aware reconstruction module with dual feature extraction. This module not only generates a coarse reconstruction that preserves the main structures, but also provides practical and accurate semantic- and pixel-level conditional signals to guide the diffusion model. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the rate-distortion-perception tradeoff at extremely low bitrates (below 0.03 bpp), effectively preserving both perceptual quality and pixel-wise fidelity in the reconstructed images. We will release the source code and trained models at this https URL.
[CV-137] Paving the Way for Point Cloud Video Representation Learning Using A PDE Model
链接: https://arxiv.org/abs/2606.01604
作者: Zhuoxu Huang,Zhenkun Fan,Jungong Han,Josef Kittler
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI) in 2026
Abstract:Investigating spatial-temporal correlations, specifically how spatial points vary over time, is crucial for understanding point cloud videos. Traditional methods, particularly flow-based techniques, struggle with these correlations due to the unordered spatial arrangement of sequential point cloud data. To address this challenge, we propose a novel approach that regularizes spatial-temporal correlation learning by formulating the problem as a solvable Partial Differential Equation (PDE). While PDEs have long been effective in the physical domain, their application to novel sequential data like point cloud video remains underexplored. Inspired by fluid analysis, we construct a simplified PDE, and the process of solving PDE is guided and refined by a contrastive learning structure between the temporal embeddings and the spatial embeddings. With this extra supervision, our method, named MotionPDE, serves as an effective, plug-and-play enhancement module for existing backbone models, adding minimal computational overhead and parameters. Capitalizing on the contrastive learning process, we delve deeper into the self-supervised capabilities of MotionPDE, yielding promising results that underscore its utility and adaptability in point cloud video data interpretation. The code repo with trained checkpoints will be available at this https URL for facilitating future research.
[CV-138] EIVE: End-to-End Instance-Specific Visual Explanations for Detection Transformers
链接: https://arxiv.org/abs/2606.01601
作者: Jianlin Xiang,Yanshan Li,Linhui Dai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures
Abstract:Visual explainability for object detection remains challenging due to the multi-instance nature of detection. Existing approaches predominantly adopt post-hoc paradigms, such as gradient-based or perturbation-based explanation methods, to interpret pretrained detectors. However, these methods require additional gradient computation or repeated model inference, resulting in limited efficiency. To address this issue, we propose an End-to-end Instance-specific Visual Explanation framework (EIVE) that directly generates instance-level saliency maps following the forward pass of Detection Transformer (DETR)-like models. Specifically, we reformulate the cross-attention mechanism in the decoder as an instance-level feature attribution pathway, so that the cross-attention of each object query corresponds to the visual attribution of its predicted instance. Based on this formulation, we design a cross-layer hybrid consensus fusion (CLHCF) module to aggregate cross-attention signals across decoder layers, producing stable and compact explanations. The explanation process of EIVE requires neither gradient computation nor input perturbation, yielding high computational efficiency, and applies to single- and multi-scale DETR-like object detectors. Finally, we present an attention-aware joint training strategy (AAJTS) as a training-oriented application, which imposes spatial constraints on cross-attention patterns to encourage stable and concentrated attribution representations, thereby improving both interpretability and detection performance. Experiments on MS COCO 2017, ExDark, and Cityscapes demonstrate that EIVE produces high-quality instance-level saliency maps and achieves performance comparable to, or better than, state-of-the-art post-hoc methods across standard metrics, while substantially improving explanation efficiency. Code is available at this https URL.
[CV-139] LG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning
链接: https://arxiv.org/abs/2606.01591
作者: Ali Alavi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, …) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video’s action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.
[CV-140] Effective Multi-sensor Conditioning for Street-view Novel-view Synthesis
链接: https://arxiv.org/abs/2606.01590
作者: Zhengfei Kuang,Adam Sun,Liyuan Zhu,Tong Wu,Shengqu Cai,Jonathan Tremblay,Iro Armeni,Ehsan Adeli,Lior Yariv,Gordon Wetzstein
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Modern vehicle platforms are equipped with a rich sensor suite, including LiDAR, calibrated multi-camera rigs, and accurate ego-motion, that in principle offers strong signal for re-rendering a driving scene from novel viewpoints. A growing line of recent work leverages video diffusion models for this task, using their generative priors to synthesize plausible novel views from sparse vehicle observations. In practice, however, existing methods exploit only a fragment of this signal, and their quality tends to degrade as the target trajectory departs from the recorded driving path. We argue that this is fundamentally a multi-sensor fusion problem: sparse LiDAR reprojections supply accurate but incomplete metric geometry, surround-view reference imagery supplies dense appearance but no metric depth, and camera poses tie the two together across views. We introduce StreetNVS, a video diffusion framework that jointly conditions on all three signals through a Reference-Enhanced Camera Attention module based on a relative ray-level positional encoding. We develop a two-stage curriculum training strategy that gradually exposes the model to increasingly sparse LiDAR. On the Waymo Open Dataset, StreetNVS substantially outperforms state-of-the-art baselines under sparse LiDAR conditioning, matches methods that rely on 10-100 times denser point clouds. We further show capabilities of synthesizing coherent videos along extreme out-of-trajectory paths such as elevation, lane-shift, pullback, and rotation. Our website: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR) Cite as: arXiv:2606.01590 [cs.CV] (or arXiv:2606.01590v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.01590 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-141] FLAME: Physics-Guided Neural Operators for Onboard Satellite Methane Detection in Hyperspectral Imagery
链接: https://arxiv.org/abs/2606.01577
作者: Junhyuk Heo,Junhwan Park,Sancheol Sim,Beomkyu Choi,Woojin Cho
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Methane is a major driver of near-term climate change, and rapidly identifying its emission sources is a critical climate intervention. Spaceborne hyperspectral imagery is the primary tool for this task, but the volume of data produced by each sensor makes ground-based detection impractical and necessitates onboard detection. Classical methods incur prohibitive computational cost on onboard hardware, while deep learning models are fast but fall short on detection quality. We propose FLAME, a physics-guided neural operator that builds the physics of methane absorption directly into its architecture. On the methane detection benchmark, FLAME achieves the highest detection accuracy among all evaluated methods, reduces the pixel-level false positive rate by nearly 3\times over the strongest neural baseline, uses the fewest parameters among learned baselines, and runs within the latency budget of onboard satellite hardware.
[CV-142] Deformable Wiener Filter for Future Video Coding
链接: https://arxiv.org/abs/2606.01576
作者: Xuewei Meng,Chuanmin Jia,Xinfeng Zhang,Shanshe Wang,Siwei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been published in IEEE Transactions on Image Processing
Abstract:In-loop filters have attracted increasing attention due to the remarkable noise-reduction capability in the hybrid video coding framework. However, the existing in-loop filters in Versatile Video Coding (VVC) mainly take advantage of the image local similarity. Although some non-local based in-loop filters can make up for this shortcoming, the widely-used unsupervised parameter estimation method by non-local filters limits the performance. In view of this, we propose a deformable Wiener Filter (DWF). It combines the local and non-local characteristics and supervisedly trains the filter coefficients based on the Wiener Filter theory. In the filtering process, local adjacent samples and non-local similar samples are first derived for each sample of interest. Then the to-be-filtered samples are classified into specific groups based on the patch level noise and sample-level characteristics. Samples in each group share the same filter coefficients. After that, the local and non-local reference samples are adaptively fused based on the classification results. Finally, the filtering operation with outlier data constraints is conducted for each to-be-filtered sample. Moreover, the performance of the proposed DWF is analyzed with different reference sample derivation schemes in detail. Simulation results show that the proposed approach achieves 1.16%, 1.92%, and 2.67% bit-rate savings on average compared to the VTM-11.0 for All Intra, Random Access, and Low-Delay B configurations, respectively.
[CV-143] textVG2GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer
链接: https://arxiv.org/abs/2606.01573
作者: Yibin Zhao,Yihan Pan,Jun Nan,Wenli Yang,Liwei Chen,Jianjun Yi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose \textVG^2 GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. \textVG^2 GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables \textVG^2 GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. \textVG^2 GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.
[CV-144] Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation NEURIPS2025
链接: https://arxiv.org/abs/2606.01565
作者: Xiang Fang,Wanlong Fang,Changshuo Wang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published in NeurIPS 2025, address some typos
Abstract:Vision-Language Navigation in Continuous Environments (VLN-CE) poses a formidable challenge for autonomous agents, requiring seamless integration of natural language instructions and visual observations to navigate complex 3D indoor spaces. Existing approaches often falter in long-horizon tasks due to limited scene understanding, inefficient planning, and lack of robust decision-making frameworks. We introduce the \textbfHierarchical Semantic-Augmented Navigation (HSAN) framework, a groundbreaking approach that redefines VLN-CE through three synergistic innovations. First, HSAN constructs a dynamic hierarchical semantic scene graph, leveraging vision-language models to capture multi-level environmental representations, from objects to regions to zones, enabling nuanced spatial reasoning. Second, it employs an optimal transport-based topological planner, grounded in Kantorovich’s duality, to select long-term goals by balancing semantic relevance and spatial accessibility with theoretical guarantees of optimality. Third, a graph-aware reinforcement learning policy ensures precise low-level control, navigating subgoals while robustly avoiding obstacles. By integrating spectral graph theory, optimal transport, and advanced multi-modal learning, HSAN addresses the shortcomings of static maps and heuristic planners prevalent in prior work. Extensive experiments on multiple challenging VLN-CE datasets demonstrate that HSAN achieves state-of-the-art performance, with significant improvements in navigation success and generalization to unseen environments.
[CV-145] Attention-guided Fine-tuning of Multimodal Large Language Models Improves Chain-of-Thought Reasoning
链接: https://arxiv.org/abs/2606.01558
作者: Sanchit Sinha,Guangzhi Xiong,Bohan Liu,Zhenghao He,Aidong Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.
[CV-146] ForestMamba: Sparse Mamba with Geometry-guided Queries for 3D Forest Point Cloud Segmentation
链接: https://arxiv.org/abs/2606.01549
作者: Trung Thanh Nguyen,Tuan-Anh Vu,Duc Viet Le,Yasutomo Kawanishi,Takahiro Komamizu,Ichiro Ide,Teja Kattenborn
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-based semantic and instance segmentation of terrestrial and drone LiDAR point clouds is emerging as a transformative approach for converting the complex 3D structure of forests into actionable information for forest monitoring and biodiversity assessment. However, forest LiDAR scenes remain highly challenging due to their large data volumes, irregular sampling density, overlapping and complex canopy structure, and geographic variability. Existing methods based on sparse convolutions or Transformers achieve promising results, but suffer from two key limitations: Quadratic complexity of attention scales poorly to large forest scenes, and Generic context modeling does not exploit forest structural priors, limiting tree separation in complex regions. To address these challenges, we propose ForestMamba, a structure-aware method that incorporates forest-specific priors into feature encoding, query generation, and query refinement, while replacing quadratic attention with linear-time state-space modeling. First, we introduce a sparse encoder with vertical-priority slab serialization that organizes sparse voxels into vertically coherent sequences for efficient long-range context modeling. Second, we propose a geometry-guided query initialization strategy based on an on-the-fly multi-scale Canopy Height Model (CHM), where canopy maxima provide ecologically meaningful query seeds, supplemented by Farthest Point Sampling (FPS) to cover understory trees. Third, we design a Mamba-based query decoder that combines local kNN voxel aggregation with a spatial dual-path Mamba for query refinement with linear computational complexity. Extensive experiments across seven forest regions demonstrate that ForestMamba consistently outperforms existing baselines in both segmentation tasks, while achieving 3 times faster inference and 2.3 times lower GPU memory than Transformer-based methods.
[CV-147] PathAR: Structure-First Autoregressive Synthesis of Multimodal Pathology Images
链接: https://arxiv.org/abs/2606.01543
作者: Yuan Zhang,Jiahao Xia,Junzhang Huang,Meng Wang,Feng Chen,Guanyu Yang,Huazhu Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 7 figures
Abstract:Data scarcity in multimodal pathology motivates unified generative models that synthesize modality-specific appearance while preserving anatomically coherent structure. Although modalities differ in appearance statistics, morphological structures such as cellular topology and tissue boundaries are largely preserved across acquisition protocols. However, existing methods often model these factors within a homogeneous token stream, implicitly coupling structure with appearance and weakening structural controllability under modality shifts. To address this, we propose pathology Autorgressive modeling (PathAR), a structure-first autoregressive synthesis framework that explicitly factorizes structure and appearance for modality-label-conditioned pathology this http URL employs a dual vector quantization (Dual-VQ) tokenizer to decompose samples into mask-grounded structure and appearance tokens, and an interleaved autoregressive (IAR) transformer with asymmetric attention visibility to enforce structure-to-appearance dependence. PathAR stabilizes morphology under heterogeneous modality-specific appearances and enables spatially aligned image–mask pair generation. Extensive experiments show that PathAR improves structural consistency and modality fidelity over baselines, maintains sample diversity, supports downstream segmentation in data-scarce regimes, and demonstrates extensibility to finer-grained intra-modality organ-label variation.
[CV-148] MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics
链接: https://arxiv.org/abs/2606.01538
作者: Žiga Kovačič,Kevin Ellis
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 13 figures. Project page: this https URL
Abstract:To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.
[CV-149] PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder ICML2026
链接: https://arxiv.org/abs/2606.01537
作者: Yancheng Liu,Kenichi Maeda,Manan Pancholy
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the ICML 2026 3rd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)
Abstract:Clinical diagnosis often requires combining imaging with physiological measurements, yet deployed models typically operate on unimodal data. We present PaCX-MAE, a cross-modal distillation framework that injects physiological priors into chest X-ray (CXR) encoders while remaining strictly unimodal at inference. PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective, aligning CXR representations with paired ECG and laboratory embeddings. Extensive evaluation across nine benchmarks demonstrates consistent improvements over domain-specific MAE, particularly on physiology-dependent tasks (e.g., +2.7 AUROC on MedMod; +6.5 F1 on VinDr). The method proves highly label-efficient in the 1% regime and preserves anatomical fidelity, achieving parity with MAE on segmentation tasks. Zero-shot and attention analyses confirm that PaCX-MAE successfully learns to attend to physiological indicators, such as the cardiac silhouette, absent in standard visual pretraining.
[CV-150] MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes
链接: https://arxiv.org/abs/2606.01518
作者: Ye Tao,Yuxin Yao,Kendong Liu,Dapeng Wu,Junhui Hou
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: 18 pages, 7 figures
Abstract:Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.
[CV-151] Splatshot: 3D Face Avatar Generation from a Single Unconstrained Photo
链接: https://arxiv.org/abs/2606.01493
作者: Hao Liang,Zhixuan Ge,Soumendu Majee,Joanna Li,Ashok Veeraraghavan,Guha Balakrishnan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 15 figures
Abstract:Reconstructing a photorealistic 3D face avatar from a single unconstrained photograph is challenging: feed-forward 3D Gaussian Splatting (3DGS) models degrade on out-of-distribution inputs, while pretrained diffusion models produce high-fidelity images but lack multi-view consistency. We observe that these paradigms are fundamentally complementary: explicit 3D representations guarantee geometric consistency, whereas 2D diffusion priors ensure photorealism. Building on this, we propose SplatShot, a training-free framework that couples these representations directly within the denoising process. Given a base 3DGS face model and a single reference image, we jointly denoise all target views using a per-step 3D feedback loop. At each timestep, we predict clean images from the noisy latents, refit the 3DGS to these multi-view predictions, and back-propagate the photometric discrepancy between the 3DGS re-renderings and 2D predictions into the noise estimate. This steers the sampling trajectory toward strictly 3D-coherent, identity-faithful outputs. Experiments on diverse in-the-wild images demonstrate that SplatShot produces 3D avatars with superior identity preservation, photorealism, and multi-view consistency.
[CV-152] Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering
链接: https://arxiv.org/abs/2606.01485
作者: Ali Alavi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emphImplicitQA / \emphVRR-QA benchmark~\citeimplicitqa: multiple-choice video question answering in which answers are deliberately \emphnot observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\citeqwen25vl, Qwen3-VL~\citeqwen3vl, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\citevideor1 and VideoChat-R1.5~\citevideochatr15) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\citeselfconsistency, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emphperception-bound rather than reasoning-bound: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception – relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved – and a prompt that explicitly injects monocular depth cues to attack the weakest category \emphlowers test accuracy by 5.8 points, confirming that the model needs a better \emphpercept, not a better \emphprocedure.
[CV-153] SafeGen-Bench: Benchmarking Safety in Image-Conditioned Text-to-Video Generation
链接: https://arxiv.org/abs/2606.01481
作者: Yingzi Ma,Xiaogeng Liu,Yawen Zheng,Chaowei Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures, 2 tables
Abstract:With the rapid advancements in text-to-image diffusion models, generative video models (T2V models) like Sora can now produce short synthetic videos from a text prompt or an initial image. However, synthetic video generation – especially when guided by an initial image – often poses risks, including the potential creation of illegal, politically sensitive, or unethical content. Existing benchmarks have started to consider the safety of generated videos, but they primarily focus on testing models with malicious text prompts, ignoring the scenario where text prompt and image combination may still lead to harmful video content. In practice, this is a common and challenging issue: videos generated from safe text and image inputs can nonetheless convey harmful information. To bridge this gap, we introduce SafeGen-Bench, a benchmark specifically designed to evaluate the safety of conditional T2V models. Our benchmark defines 10 malicious categories, concentrating on risks related to both temporal sequences and depicted behaviors. SafeGen-Bench consists of carefully selected start frames from diverse image and video sources, paired with corresponding text prompts to simulate realistic inputs. We evaluate a variety of conditional T2V models on SafeGen-Bench, and the results indicate that current models struggle to consistently avoid generating malicious content with unsafety scores reaching up to 44.5, especially under conditions requiring high quality. Furthermore, we assess the effectiveness of both text-based and image-based guardrails on our benchmark, finding that unimodal guardrails alone were insufficient to provide a robust defense, with an 80% failure rate across seven malicious categories. We hope that SafeGen-Bench will foster the development of safer and more controllable conditional T2V models.
[CV-154] UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures
链接: https://arxiv.org/abs/2606.01443
作者: Triet M. Le
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses this by enforcing an isotropic Gaussian target on the embeddings via Sketched Isotropic Gaussian Regularization (SIGReg). This target is in tension with the manifold hypothesis, which expects embeddings to concentrate on a low-dimensional subset of the ambient space. We propose \emphUR-JEPA, which targets a uniformly n -rectifiable measure of local tangent dimension n at small scales, realized through a Gaussian-kernel smoothed Carleson-type square function \mathcalL^\textCGLT , with a complementary Jones \beta -number formulation. On Inet10, UR-JEPA( \mathcalL^\textCGLT ) attains 0.9141 \pm 0.0014 for a +0.83 ,pp gain over LeJEPA( \mathcalL^\textSIGReg ) with \sim 30% lower seed standard deviation; on matched-recipe Galaxy10~SDSS, a single-seed ImageNet- 100 run, and a 3 -seed EuroSAT remote-sensing run, the two methods lie in the same peak-accuracy band at convergence, with UR-JEPA retaining its lower-seed-variance signature. On EuroSAT the in-domain pair is competitive at 96.0 to 96.1% with large remote-sensing foundation-model transfer at a 25\times smaller backbone. The distinction is geometric: direct visualization of the projector output distribution shows that on all four datasets UR–JEPA( \mathcalL^\textCGLT ) produces a global PCA spectrum with a 4 to 5 order-of-magnitude drop at index \sim 20 to 25 out of D = 32 , while LeJEPA’s spectrum is near-flat (top-to-bottom ratio at most 3.6 ). Per-dimension marginals are simultaneously near-Gaussian for both methods (mean Shapiro-Wilk W \in [0.992, 0.996] ) as a Diaconis-Freedman consequence. At matched accuracy the two regularizers therefore yield structurally distinct projected representations.
[CV-155] DENSER: Depth-Guided Ensemble with Staged EFA-GS Reconstruction for Soccer Novel View Synthesis CVPR2026 SOCC
链接: https://arxiv.org/abs/2606.01419
作者: Parthsarthi Rawat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 SoccerNet Novel View Synthesis Challenge, Rank 1
Abstract:We propose DENSER, a Depth-guided ENSemble with Staged EFA-GS Reconstruction for soccer novel view synthesis. DENSER extends EFA-GS with three key contributions: (1) camera-height-based loss weighting that prioritises ground-level broadcast views, (2) monocular depth supervision from Depth-Anything-V2 to regularise geometry in textureless regions, and (3) a three-model pixel-average ensemble whose members diverge from a shared base checkpoint by varying training length and Gaussian scale clamping. On five held-out challenge scenes we achieve a mean PSNR of 29.89 dB, SSIM of 0.791, and LPIPS of 0.366.
[CV-156] Agent Skills Should Go Beyond Text: The Case for Visual Skills
链接: https://arxiv.org/abs/2606.01414
作者: Binxiao Xu,Ruichuan An,Bocheng Zou,Hang Hua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reusable skills are a key mechanism for extending agent capabilities, allowing agents to accumulate experience and solve increasingly complex tasks. Yet most existing skill-learning methods store reusable experience as text-only assets, such as instructions, reasoning traces, or summarized trajectories. We argue that this text-only paradigm creates a fundamental bottleneck for visual-centric tasks, where reusable knowledge often depends on spatial layout, visual grounding, fine-grained appearance, and localized state changes. To address this limitation, we propose \textbf\NAME, a multimodal skill paradigm that combines declarative textual logic with explicit visual support. We distinguish three reusable forms: static priors for stable spatial conventions, dynamic priors for in-situ visual working memory, and interleaved visual skills that bind ordered text steps to the source frames, screenshots, or page regions that justify them. Rather than only describing what to do, visual skills also encode where to look, how to inspect, and how to verify visual outcomes. To scale visual-skill construction, we introduce \textbf\SYSTEM, an automatic system that converts agent experience into reusable multimodal skills by preserving textual reasoning, spatial references, visual boundaries, and interaction patterns from task trajectories. Experiments on GUI and other visual-centric tasks show that visual skills consistently outperform text-only skills, particularly when success requires spatial correspondence, visual evidence, and state-aware interaction. These results support our central position: reusable agent skills should go beyond text and become multimodal assets for future multimodal agents.
[CV-157] PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion
链接: https://arxiv.org/abs/2606.01399
作者: Heyuan Gao,Bangxun Tang,Yiren Song,Guian Fang,Zijian He,Jie Yang,Mike Zheng Shou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present PAI-Studio, a new reference-conditioned video synthesis task that addresses a long-standing challenge in cinematic background replacement: generating dynamic backgrounds aligned with foreground motion while preserving foreground identity, matching reference scene appearance, and achieving globally consistent illumination with realistic foreground relighting. Existing open-source systems and commercial APIs cannot simultaneously ensure motion-consistent background generation, high-fidelity foreground relighting and foreground identity preservation, often resulting in static backgrounds, inconsistent boundaries, and noticeable compositing artifacts. To bridge this gap, we build upon a Diffusion Transformer video backbone and reformulate the problem as an in-context conditional generation task. Through bidirectional attention, our model jointly captures foreground dynamics and background reference information within a unified architecture. We further construct a 30K-scale dataset sourced from high-quality films and online videos to support this task. Extensive evaluations demonstrate that our method significantly outperforms existing open-source and commercial API solutions.
[CV-158] raining-free image inversion for one-step diffusion models
链接: https://arxiv.org/abs/2606.01380
作者: Tao Wu,Senmao Li,Yaxing Wang,Shiqi Yang,Kai Wang,Joost van de Weijer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to Pattern Recognition
Abstract:In this work, we introduce a novel training-free inversion (TFinv) framework for one-step diffusion models,addressing key challenges in real image inversion and editing. We first identify two critical factors hamperingreal-image inversion and editing: (1) Initial Latent Editability, which is related to the distance between theinitial noise and the ideal Gaussian distribution, and (2) Caption Gap, which means the alignment betweentext captions and image representations. Both factors influence inversion efficiency and the editability ofone-step diffusion models. Then, we propose two novel techniques: iterative noise alignment (iterNA), whichminimizes the distribution gap to align with the normal Gaussian distribution, and suffix learning (suffL),which enhances text-to-image caption alignment by introducing learned suffix prompt tokens. These techniquesenable precise inversion of input images into their initial noise representations and facilitate image this http URL, we propose a mask-based editing technique for localized edits while preserving backgroundintegrity. Comprehensive experiments on the PIE-Bench dataset validate that our method TFinv not onlyachieves state-of-the-art performance in one-step diffusion editing, but also significantly outperforms existingmultistep approaches in efficiency. The code is available at this https URL.
[CV-159] BRo-JEPA: Learning Modular Arithmetic in Latent Space
链接: https://arxiv.org/abs/2606.01372
作者: Divyansh Jha,Yuanfang Xie,Varan Mehra,Brennen Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 14 figures
Abstract:Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as states and modular arithmetic operations as actions in a JEPA-style latent world model. Standard supervised baselines and JEPA models with additive operation embeddings fit seen operations but fail to extrapolate reliably to unseen ones. To bridge this gap, we introduce a block-rotation predictor that imposes the circular structure of modulo-10 arithmetic in latent space. This enables strong zero-shot generalization, with the best ResNet-based JEPA block-rotation model achieving 99.46% zero-shot and 99.46% rollout accuracy. Our results suggest that latent world models can learn symbolic transformation rules when architecture matches the structure of the problem. Our code can be \hrefthis https URLaccessed here.
[CV-160] ActMVS: Active Scene Reconstruction with Monocular Multi-View Stereo ICRA2026
链接: https://arxiv.org/abs/2606.01367
作者: Guo Pu,Yixuan Han,Zhouhui Lian
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: ICRA 2026
Abstract:Active scene reconstruction enables robots/UAVs to autonomously plan trajectories and reconstruct environments without costly manual data acquisition. Unlike passive methods, active reconstruction requires real-time construction of high-confidence occupancy maps for collision-free navigation. Existing approaches rely on depth sensors for occupancy map updates, increasing platform cost and weight. To advance spatial intelligence, we aim for a vision-only monocular solution. However, current monocular scene reconstruction methods operate offline and fail to deliver globally consistent dense depth at the frame rates required for robots/UAVs navigation. To bridge this gap, we introduce ActMVS, the first framework for monocular active reconstruction. Our framework integrates a view factor graph construction for informed Multi-View Stereo depth prediction, along with a global depth optimization, to enable the online generation of high-quality, globally consistent dense depth maps. This enables monocular robots/UAVs to maintain reliable occupancy maps for safe trajectory planning during reconstruction. Experiments on Replica datasets demonstrate performance competitive with RGB-D methods. Our code and data are available at this https URL.
[CV-161] AlbedoEdit: Unified Instance-Level Video Editing with Albedo Guidance
链接: https://arxiv.org/abs/2606.01362
作者: Xilong Zhou,Bao-Huy Nguyen,Zheng Zeng,Jacob Munkberg,Jon Hasselgren,Thomas Leimkühler,Nima Kalantari,Miloš Hašan,Christian Theobalt
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Video generative models have achieved remarkable progress in synthesizing photorealistic video sequences. However, enabling broader and more creative downstream applications requires fine-grained instance-level video editing, including object insertion, object removal, and texture editing, which has emerged as a prominent yet challenging problem. Existing approaches either propose unified generative frameworks with only coarse semantic control, or design task-specific frameworks for individual editing tasks, limiting their flexibility and applicability across diverse real-world scenarios. To address these limitations, we propose AlbedoEdit, a unified generative video editing framework that jointly supports object insertion, object removal, and texture editing. Our key insight is that the intrinsic albedo map, which is invariant to lighting and contains no specularity, shadowing and inter-reflection effects, provides an effective and user-friendly mechanism for specifying fine-grained appearance edits. Built upon video foundation models, AlbedoEdit is fine-tuned to translate source RGB videos into edited RGB videos, conditioned on a user-edited first-frame albedo. Trained on a new paired synthetic dataset covering all three editing tasks, AlbedoEdit implicitly learns to harmonize edited contents and simulate complex real-world visual effects triggered by editing operations, including specular highlights, soft shadows, and mirror reflections. AlbedoEdit demonstrates superior performance over state-of-the-art video editing approaches, both qualitatively and quantitatively. Project webpage is this https URL.
[CV-162] Diamonds in the Sky: Pareidolic Animals in Clouds
链接: https://arxiv.org/abs/2606.01361
作者: Miriam Horovicz,Yacov Hel-Or,Yael Moses
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:People often see animal shapes in clouds, a phenomenon known as pareidolia. We propose an AI-based method that aims to predict which animals people are likely to perceive in clouds, even though state-of-the-art recognition methods typically fail to detect such animals. Additionally, we introduce a method to assist individuals in perceiving specific pareidolic animals, even if they did not recognize them initially. Our approach uses a diffusion model to transform cloud segments into an animal shape that visually resemble the original cloud. This diffusion technique is inspired by the observation that the diffusion process succeeds only when the target animal resembles the shape of the cloud, and that subtle visual hints often suffice to help individuals recognize specific pareidolic animals. A generated image, successfully derived from the diffusion model, is then used to predict the pareidolic animal. Additionally, a short morphing video transitioning from the generated image back to the original cloud segment is employed to further enhance the human’s perception of the pareidolic animals.
[CV-163] ChartArena: Benchmarking Chart Parsing across Languages Scenarios and Formats
链接: https://arxiv.org/abs/2606.01348
作者: Shangpin Peng,Gengluo Li,Xingyu Wan,Chengquan Zhang,Hao Feng,Binghong Wu,Huawen Shen,Weinong Wang,Ziyi Cai,Zhuotao Tian,Han Hu,Can Ma,Yu Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Charts are a primary medium for conveying quantitative and relational information, yet systematically evaluating chart parsing models remains difficult. Existing benchmarks focus on narrow chart types and leave diagrammatic structures such as flowcharts and mind maps largely unaddressed, while models produce outputs in incompatible formats, and datasets rarely include the printed or hand-drawn images encountered in practice. To address these issues, we introduce ChartArena, a comprehensive bilingual benchmark covering eight chart families spanning both numeric charts and diagrammatic structures, each evaluated across three visual scenarios: digital renderings, printed photos, and hand-drawn photos. The dataset is built via a human-agent collaborative annotation pipeline with multi-stage human verification to ensure annotation reliability. To enable fair cross-model comparison, we further design a format-agnostic evaluation protocol that maps heterogeneous outputs into two canonical semantic spaces, a normalized triple view and a directed graph view, and scores them with structure-aware metrics. Through extensive evaluation of 26 leading MLLMs, we observe three consistent findings: (i) frontier proprietary models such as Gemini 3.1 Pro lead overall, yet the strongest open-source systems are rapidly closing the gap; (ii) document parsing models handle numeric charts reasonably but fall sharply behind on diagrammatic structures; and (iii) expert chart parsers remain limited to narrow chart families. Across all models, radar charts and hand-drawn scenarios stay especially challenging. These findings show that ChartArena exposes clear capability gaps and provides a unified foundation for future progress. ChartArena is publicly available at this https URL.
[CV-164] HOLA: Holistic Multi-Modal Alignment for Open-Set 3D Recognition
链接: https://arxiv.org/abs/2606.01334
作者: Koby Aharonov,Oren Shrout,Ayellet Tal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-set 3D recognition requires models that generalize to rare or unseen categories. Recent approaches address this by distilling language-vision knowledge into 3D encoders, typically relying on heavy 2D ViTs and aligning each point cloud with a single image or caption, thus anchoring representations to partial views. We propose aligning each point cloud with multiple images and textual descriptions to capture a more holistic understanding of 3D objects. To realize this idea, it is essential to design a loss function capable of jointly aligning a 3D instance with multiple matched signals, multi-view images and multiple texts, while separating positive aggregation from negative competition. We introduce such a function, termed the decoupled multi-positive contrastive loss. Our formulation enhances the loss’s hardness-aware focus on challenging negatives, avoiding the “spotlight crowding” that occurs when many positives share the same softmax with all the negatives. Complementing this, we present a lightweight text adapter applied only to web captions, reducing the domain gap to curated annotations and enabling effective use of large-scale unsupervised text. Our model demonstrates state-of-the-art open-vocabulary performance on long-tail benchmarks, yielding substantial zero-shot improvements while sustaining high frame rates.
[CV-165] DeblurNVS: Geometric Latent Diffusion for Novel View Synthesis from Sparse Motion-Blurred Images
链接: https://arxiv.org/abs/2606.01315
作者: Changyue Shi,Wangbo Yu,Chaoran Feng,Li Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Novel view synthesis (NVS) is a fundamental problem in computer vision and graphics. Recent advances in neural radiance fields (NeRF), 3D Gaussian Splatting (3DGS), and generative view synthesis have substantially improved its quality. Yet most methods still rely on clean observations, where image structures and cross-view geometric cues are well preserved. Motion blur breaks this assumption by corrupting local details and weakening multi-view correspondences. Such blur commonly arises from camera shake, scene motion, or finite exposure in practical capture. Blur-aware NVS methods address this degradation by modeling image formation, but their reliance on costly per-scene optimization limits efficient and generalizable sparse-view synthesis. To address this, we propose DeblurNVS, a novel framework for synthesizing high-fidelity novel views directly from sparse motion-blurred images, without requiring per-scene optimization. DeblurNVS restores the intermediate geometric representations needed for multi-view reasoning, enabling blurred inputs to recover reliable structure and correspondence cues. The restored representations are then combined with target camera information to synthesize the target-view representation and reconstruct a sharp RGB novel view. To enable the large-scale training, we construct a motion-blurred NVS dataset from DL3DV-10K using interpolation-based finite-exposure blur synthesis. Extensive experiments demonstrate that DeblurNVS outperforms existing baselines on synthetic motion-blur benchmarks and generalizes to real motion-blurred scenes, producing perceptually sharper and structurally more stable novel views while avoiding costly per-scene optimization. Project page: this https URL.
[CV-166] Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning
链接: https://arxiv.org/abs/2606.01287
作者: Garvin Guo,Yu Chen,Xiang Wang,Shuai Li,Xinpei Zhao,Huaxing Liu,Shuai Dong
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent latent visual reasoning methods achieve substantial gains by inserting continuous latent tokens into multimodal language models. These gains are commonly attributed to the tokens encoding visual evidence; recent analyses, however, reveal a paradox: the tokens are loosely tied to the image and contribute little to the answer. Critically, these analyses treat latent tokens as a single unit, obscuring the true source of the gains. We therefore decompose latent tokens into three testable components: latent slots, boundary markers, and format, and develop a state-of-the-art method as a probe under favorable conditions. Across six method-stage settings and four perception-heavy benchmarks, latent slots fail every prediction of the visual-memory account. Strikingly, retaining only the boundary markers preserves 78 to 100% of the gain in several settings, while the model attends to the image more narrowly at latent positions than at answer positions. The gain therefore comes from boundary markers, format, and this attention pattern, not from latent slots. How each method engages this mechanism depends on its training supervision: at matched accuracy, mechanisms can still differ markedly. Latent visual reasoning thus needs evaluation not only by accuracy but by what the model actually relies on.
[CV-167] Knowledge-Intensive Video Generation
链接: https://arxiv.org/abs/2606.01285
作者: Chenxu Wang,Mingda Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-video generation has advanced rapidly in visual quality, but remains under-evaluated for factuality and practical usefulness. We introduce knowledge-intensive video generation (KIVI), where models generate videos from short information-seeking prompts that ask for explanations, procedures, or demonstrations. To evaluate this setting, we construct KIVI-Bench, a benchmark of 1,080 prompts, and propose automatic metrics for factuality and helpfulness. Human evaluation shows that our metrics significantly better align with human annotations than existing alternatives. Experiments on seven state-of-the-art video generation models show that current systems still lag behind human performance, especially on visual properties, procedural operations, and clear information presentation. These results highlight KIVI as a challenging direction for factual and instructionally useful video generation.
[CV-168] KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation
链接: https://arxiv.org/abs/2606.01282
作者: Farbod Davoodi,Seyed Reza Tavakoli Shiyadeh,Pooria Safaei,Sana Harighi,Parsa Gholami,Amirali Amini,Kimia Vanaei,Emad Firoozi,Parham Abed Azad,Babak Khalaj,Siavash Ahmadi,Amir Hossein Payberah,Mohammad Hossein Rohban,Soheil Kolouri,Ali Diba
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Text-to-Image (TTI) systems are now everyday infrastructure for journalism, education, advertising, and public communication, and the demographic and cultural stereotypes they inherit from training data (rendering women, people of colour, older adults, and non-Western cultures as under-represented or caricatured) become a population-level harm at deployment scale. Existing mitigations either require costly retraining, infeasible for the closed-source backbones that dominate consumer products, or rely on fixed demographic templates that ignore cultural context. We present KG-FairDiff, a model-agnostic, inference-time framework that formalises fairness-aware prompt refinement as a constrained optimisation problem and operationalises it as a closed-loop pipeline: a knowledge graph of ~1,200 culture- and bias-related triples retrieves structured context, an LLM rewriter proposes refinements, and a validator accepts only prompts that reduce a divergence-based fairness loss while preserving semantic fidelity to the user’s original intent. We prove a finite-termination bound for the refinement loop, contribute a mathematically consistent evaluation suite linking Bias-P/Bias-W to divergence from target distributions and ENS to KL divergence, and audit eight widely-deployed backbone generators. KG-FairDiff substantially reduces gender, race, age, and intersectional disparities while preserving prompt semantics, offering a practical, deployment-ready route to more equitable generative AI.
[CV-169] Event-Based Vision in Space: Applications Trends and Future Directions MICRO
链接: https://arxiv.org/abs/2606.01280
作者: Luigi Capogrosso,Pietro Bonazzi,Michele Magno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the XXIV Annual Conference on Sensors and Microsystems (AISEM) 2026
Abstract:Earth Observation (EO) is undergoing a significant transformation driven by the deployment of novel sensing technologies. Traditional frame-based optical sensors often struggle with motion blur, high power consumption, and extreme data redundancy in challenging orbital environments. In contrast, event-based sensors, also known as neuromorphic cameras, offer a bio-inspired asynchronous approach. By capturing only local illumination changes, they provide microsecond temporal resolution, an extremely high dynamic range, and exceptional energy efficiency. Although the use of these sensors is rapidly expanding from terrestrial systems to orbital platforms, the scientific literature surrounding their space-based applications remains heavily fragmented. To bridge this gap, this article presents a comprehensive review of the state-of-the-art in event-based vision in the space domain. Based on the retrieved literature, we introduce a taxonomy structured around four primary domains: 1) atmospheric and high-speed observation; 2) environmental monitoring and change detection; 3) operational support and onboard processing; and 4) geospatial modeling and predictive analysis. As a result, this survey highlights that neuromorphic engineering is far more than a supplementary imaging technique; it is a paradigm shift that can be used to directly address critical bottlenecks in modern remote sensing and sustainable space exploration.
[CV-170] DeepIPCv3: Event-Aware Multi-Modal Sensor Fusion for Sudden Pedestrian Crossing Avoidance
链接: https://arxiv.org/abs/2606.01277
作者: Oskar Natan,Andi Dharmawan,Aufaclav Zatu Kusuma Frisky,Jazi Eko Istiyanto,Jun Miura
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注:
Abstract:Current end-to-end autonomous driving systems predominantly rely on frame-based sensors, which suffer from inherent perception latency and motion blur during highly dynamic encounters, specifically sudden pedestrian crossings. To address this critical safety vulnerability, we propose DeepIPCv3, a novel multi-modal autonomous navigation framework that synergizes the dense 3D spatial geometry of LiDAR point clouds with the microsecond-level asynchronous event streams of a Dynamic Vision Sensor (DVS). We introduce a Transformer-inspired cross-modal attention mechanism to dynamically correlate these distinct modalities, allowing the network to instantaneously prioritize high-speed dynamic updates without sacrificing structural scene awareness. The fused latent representations are then mapped to safe local waypoints and executable control commands via a hybrid policy network that blends heuristic trajectory tracking with direct neural predictions. Due to the severe physical risks associated with live testing of these sudden crossing scenarios, the framework is rigorously evaluated offline using a custom multi-modal dataset collected across both well-illuminated noon and challenging evening conditions. Extensive comparative and ablation studies demonstrate that DeepIPCv3 achieves state-of-the-art predictive performance. By effectively eliminating exposure failures and motion blur, the proposed LiDAR and DVS fusion yields the lowest trajectory and control command errors, enabling highly reactive, mathematically bounded evasive maneuvers regardless of ambient illumination. To support future research, we will release the codes to our GitHub repo at this https URL.
[CV-171] Exploiting In-Sensor Computing for Energy-Efficient Earth Observation MICRO
链接: https://arxiv.org/abs/2606.01271
作者: Luigi Capogrosso,Pietro Bonazzi,Loris Hoxhaj,Michele Magno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the XXIV Annual Conference on Sensors and Microsystems (AISEM) 2026
Abstract:The rapid growth of the satellite industry has driven a significant increase in geospatial data acquisition, highlighting a critical bottleneck: the severe disparity between the volume of collected sensor data and the limited downlink bandwidth available to ground stations. While On-Board Computing (OBC) has helped address this by pre-processing data in orbit, this article further advances the paradigm by introducing an in-sensor computing framework. We present an optimized end-to-end Earth Observation (EO) pipeline tailored for strict computational constraints by integrating TinyML techniques with the Sony IMX500 Intelligent Vision Sensor. Specifically, our approach shifts processing directly to the sensor level, offloading the computation from the primary embedded device, and effectively mitigating the downlink transmission of noisy or irrelevant data. We evaluated several efficient Convolutional Neural Networks (ConvNets), i.e., SqueezeNet, ShuffleNetV2, and MCUNetV1, on the EuroSAT dataset. Experimental results show that, despite the optimizations required for deployment on the IMX500 platform, our models maintain a competitive 96.68% accuracy while operating within its 8 MB constraints. Specifically, the models reach an average processing throughput of 17.40 FPS with a latency of 27.43 ms. Furthermore, our system profile exhibits high energy efficiency, with a low energy footprint of 14.19 mJ per inference and an efficiency rating of 42.26 GMAC/J, demonstrating its viability for in-sensor deployment.
[CV-172] Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
链接: https://arxiv.org/abs/2606.01247
作者: Liyang Li,Muzhi Zhu,Zhiyue Zhao,Hengyu Zhao,Ke Liu,Linhao Zhong,Hao Chen,Chunhua Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Humans can reproduce the viewpoint specified by a target image through active head and body motion, yet spatial intelligence in foundation models has largely been studied as passive understanding of pre-collected observations. We introduce Target Viewpoint Reproduction (TVR) – an active task where an agent adjusts its viewpoint in a 3D environment until its observation matches a given target image – and TVRBench, an indoor-simulation benchmark spanning scene scale and target-view visual richness. TVR is far from solved: on the evaluation split, the strongest open-source and closed-source models reach only 7.8% and 12.0% success. Fine-grained analysis identifies two consistent bottlenecks: off-the-shelf models struggle with multi-turn visual history, and performance drops sharply when viewpoint reproduction requires body translation rather than in-place rotation, exposing a gap in mapping spatial discrepancies to embodied movement. To study reducing this gap, we build a unified TVR post-training framework covering expert-trajectory SFT, rationale-supervised CoT-SFT, offline Single-turn GRPO, and on-policy Multi-turn GRPO from live simulator rollouts. Visual-action SFT supplies the main gain, raising a 9B open-source model to 50.8% success; Multi-turn GRPO provides targeted multi-room refinement and reaches 51.4% overall, while CoT supervision and Single-turn GRPO degrade closed-loop performance. These results establish TVRBench as a testbed for measuring and training foundation models that actively perceive and act in 3D environments. Our code, data, and models are available at this https URL.
[CV-173] Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers
链接: https://arxiv.org/abs/2606.01217
作者: Aadithya Prabha Ramaharsha,Deevna Reddy,Uma Ranjan
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Applications (stat.AP)
备注: Third International Conference Biomedical Engineering Science and technology
Abstract:Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits
[CV-174] Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning
链接: https://arxiv.org/abs/2606.01207
作者: Zhiqiang Zhou,Xuezhen Xie
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 8 pages,6 figures,4 tables
Abstract:The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation’s sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation’s advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.
[CV-175] PairedGTA: Generating Driving Datasets for Controlled Photometric Shift Analysis
链接: https://arxiv.org/abs/2606.01192
作者: Andrea Chianese,Giulio Rossolini,Alessandro Biondi,Marco Cococcioni,Giorgio Buttazzo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Evaluating the performance of visual perception systems for autonomous driving is essential to ensure reliable operation across diverse environmental scenarios. Ideally, a balanced and fair analysis across different adverse conditions would require perfectly paired images of the same scene under different weather or illumination changes. This would allow evaluating the effect of photometric shifts independently of geometry and semantic changes. Unfortunately, real-world datasets rarely provide images of the same scene under different environmental conditions, because, normally, camera pose, traffic, and locations of dynamic objects (vehicles, pedestrians, etc.) vary over time, thus yielding only coarsely paired data. To address this challenge, this work introduces a data generation framework based on a high-fidelity game engine for extracting perfectly paired images. By leveraging software APIs that communicate with the GTA game engine, the framework modifies illumination and weather conditions while preserving scene geometry, camera pose, and the identity and placement of dynamic objects. For each sampled location, it procedurally instantiates dynamic entities and renders pixel-aligned images under diverse adverse conditions. The benefit of the proposed generation framework in driving scenarios is demonstrated through a systematic analysis of semantic segmentation models, whose output degradation can be attributed more directly to photometric shifts rather than to uncontrolled semantic or geometric factors.
[CV-176] Reusing Fusion-Time Spectral Reliability for Adaptive Fusion and Expert Routing in RGB-Infrared Object Detection
链接: https://arxiv.org/abs/2606.01173
作者: Yefeng Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:RGB-infrared detectors typically discard the statistics generated during cross-modal fusion, leaving downstream modules unaware of whether the current interaction is reliable. We propose to extract a parameter-free, 7-dimensional spectral reliability descriptor – summarizing band energy, amplitude ratio, phase consistency, and cross-modal correlation – and to reuse it beyond the fusion stage. The descriptor drives both Spectral Reliability Fusion (SRF), which gates a spectral residual against a conservative spatial base, and Reliability-Conditioned Expert Routing (RCER), which combines the descriptor with pooled content to steer sparse post-fusion experts. Under matched ablations, descriptor-aware gating improves mAP50 over content-only adaptive gating; a 2\times2 factorial analysis further shows that descriptor-conditioned routing provides the larger marginal gain over expert architecture alone at near-equal parameter count. Under six synthetic degradations on DroneVehicle, average retention rises to 95.0%, versus 92.0% for content-only MoE and 87.9% for concatenation, with the largest gain under modality drop; the same model also improves mAP50 by +5.2/+5.3 on the natural day/night split. These results suggest that preserving fusion-time reliability as an explicit signal benefits both adaptive fusion and post-fusion conditional computation.
[CV-177] owards Interactive Video World Modeling: Frontiers Challenges Benchmarks and Future Trends
链接: https://arxiv.org/abs/2606.01164
作者: Jiuming Liu,Chaojun Ni,Mengmeng Liu,Chensheng Peng,Fangjinhua Wang,Sitian Shen,Marc Pollefeys,Masayoshi Tomizuka,Ayush Tewari,Per Ola Kristensson
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. The GitHub repository is publicly available at: this https URL
Abstract:With rapid development of large language models and diffusion-based content generation, world modeling has attracted increasing research attention, benefiting various downstream domains such as game engines, embodied AI, autonomous driving, etc. Through explicitly incorporating user actions into world state transition, recent literature empowers world modeling with interactivity in an action-conditioned video or 3D generation paradigm, further enhancing controllability over world evolutions and facilitating users to freely traverse, manipulate, navigate, and personalize the state evolution. In this paper, we aim to systematically review recent research trends, technical developments, evaluation benchmarks, and also propose future potential directions in interactive world modeling. Specifically, we first summarize recent efforts and trends in terms of application scenarios, world state evolution, and scene modality. Afterwards, we delve into three crucial technical challenges, including action-conditioned controllability, long-horizon interactions and memory, and action-following responsiveness for real-time interactivity. Furthermore, we also thoroughly compare existing benchmarks and metrics in four specific application fields: open-world exploration, game engine, autonomous driving, and robotics. Finally, we discuss several promising future directions in achieving next-generation interactive world modeling. The corresponding repository is publicly available at: this https URL.
[CV-178] HiTokSR: A Coarse-to-Fine Tokenizer with Hierarchical Codebooks for High-Fidelity Real-World Image Super-Resolution
链接: https://arxiv.org/abs/2606.01157
作者: Mingxi Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vector-quantized (VQ) generative models have shown promising results in real-world image super-resolution (Real-ISR). However, existing methods typically rely on a monolithic latent space that entangles low-frequency structures with high-frequency textures. This entanglement forces a single codebook to capture a combinatorially complex set of structure-texture pairings, which constrains representational capacity and limits codebook utilization. To address this issue, we present HiTokSR, a hierarchical token prediction framework. Instead of using a single codebook, HiTokSR partitions the latent space along the channel dimension into frequency-aware groups, quantizing each with an independent sub-codebook. This coarse-to-fine design disentangles global structures from fine details, enhancing combinatorial expressiveness while circumventing the optimization instability of high-dimensional nearest-neighbor lookups. To further improve semantic consistency, our generator integrates priors from a vision foundation model via adaptive feature modulation, multi-scale class tokens, and a representation alignment loss. Additionally, we introduce an index-level perturbation strategy during decoder fine-tuning to bridge the train-test discrepancy in discrete token prediction. Extensive experiments on real-world benchmarks demonstrate that HiTokSR achieves state-of-the-art performance in both perceptual quality and reconstruction fidelity.
[CV-179] CoSTL: Comprehensive Spatial-Temporal Representation Learning for Moment Retrieval and Highlight Detection
链接: https://arxiv.org/abs/2606.01149
作者: Xin Dong,Wenjia Geng,Wenfeng Deng,Yansong Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 3 figures
Abstract:Video Moment Retrieval (MR) and Highlight Detection (HD) are crucial tasks in video analysis that aim to localize specific moments and estimate clip-wise relevance based on a given text query. Recent approaches treat them as similar video grounding tasks and use the same architecture to solve them. These tasks require both fine-grained comprehension at the image level and high-level temporal understanding across the entire video. Existing approaches have primarily focused on temporal modeling using frame-level features, often neglecting the rich visual information related to the text query within individual frames. This oversight leads to inaccurate grounding results. To address this limitation, we propose a Comprehensive Spatial-Temporal Representation Learning Framework (CoSTL), which captures both fine-grained image-level information and temporal dynamics. Specifically, CoSTL incorporates a text-driven progressive fine-grained image encoder, performing a two-step text-driven knowledge extraction process to learn fine-grained spatial representations. Furthermore, a multi-scale temporal perception module captures comprehensive spatial-temporal representations, enhancing the model’s ability to process temporal dynamics. We demonstrate state-of-the-art performance on four public benchmarks: QVHighlights, Charades-STA, TACoS, and TVSum.
[CV-180] HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers
链接: https://arxiv.org/abs/2606.01132
作者: Issa Sugiura,Shuhei Kurita,Yusuke Oda,Naoaki Okazaki
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 16 pages, 17 figures
Abstract:Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.
[CV-181] STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing
链接: https://arxiv.org/abs/2606.01126
作者: Shir Maon,Odelia Melamed,Adi Shamir
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model’s accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network’s internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.
[CV-182] Rank-Aware Quantile Activation for Motion-Robust Crop Segmentation in UAV Imagery
链接: https://arxiv.org/abs/2606.01118
作者: Abinav Kiran,Sravan Danda,Aditya Challa,Sougata Sen,Daya Sagar B S
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motion blur from high-speed UAV acquisition de-grades semantic segmentation on rare texture-dependent classes with high agronomic value. Standard CNNs rely on high-frequency magnitude features that blur destroys, causing statistical erasure of minority signals. We propose Dual Quantile Activation (QAct), a rank-aware block replacing magnitude gating with instance-level rank normalization. Evaluated onAgriculture-Vision 2021 across zero-shot and blur-supervised regimes at multiple severities, QAct is the dominant architectural factor: it delivers consistent mIoU gains over ReLU across both regimes and all severities, with strongest gains on rare structural and texture-dependent classes. Some dominant classes (water,planter skip) show mixed per-class performance under distillation. At moderate blur, zero-shot QAct outperforms distillation-trained ReLU; across all severities, Distill-QAct achieves best performance, confirming rank aware activation and blur-domain training are complementary robustness sources.
[CV-183] R3: Composed Video Retrieval via Reasoning -Guided Recalling and Re-ranking
链接: https://arxiv.org/abs/2606.01113
作者: Zixu Li,Yupeng Hu,Zhiheng Fu,Zhiwei Chen,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present \mathbbR^3 , a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on this https URL.
[CV-184] mporal Evidence Routing with Structured Visual Evidence for TimeLogicQA
链接: https://arxiv.org/abs/2606.01106
作者: Yuyang Sun,Yongliang Wu,Xingyu Zhu,Yuxia Chen,Zhenxiang Jiang,Yangguang Ji,Wenbo Zhu,Yanxi Shi,Jay Wu,Shuo Wang,Xu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:TimeLogicQA evaluates whether video question answering systems can reason over temporal relations such as event existence, ordering, persistence, boundary conditions, and overlap. We address this task with a visual evidence routing pipeline that separates perception from symbolic temporal reasoning. The system first parses each question into event targets, answer mode, candidate options, and temporal operators. It then routes videos according to duration and operator difficulty, using ordered full-frame evidence for short clips and event-focused candidate windows for long videos. A multimodal large language model produces structured visual evidence for the relevant events, while programmatic verifiers recover dense action intervals and a deterministic reducer applies operator-specific temporal rules to produce the final answer. Conservative fusion accepts an answer only when the visual evidence, temporal program, and confidence checks agree, reducing noisy answer flips. On the official test evaluation, our final system achieves an AvgAcc of 81.8.
[CV-185] Adaptive Dense Evidence Refinement for Video Relational Reasoning for VRR-QA Challenge
链接: https://arxiv.org/abs/2606.01104
作者: Yuyang Sun,Yongliang Wu,Xingyu Zhu,Yuxia Chen,Zhenxiang Jiang,Yangguang Ji,Wenbo Zhu,Yanxi Shi,Jay Wu,Shuo Wang,Xu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:VRR-QA evaluates whether video-language systems can infer spatial, temporal, viewpoint, depth, and visibility relations that are not always resolved by a single frame. We present an inference-only system built around adaptive test-time computation. The system first answers each question with a direct video-language model pass, then uses multiple lightweight views to find unstable questions. Only these difficult questions are routed to a high-budget dense evidence module that constructs timestamped frame observations, relation-specific probes, candidate verification, and conservative temporal aggregation. This design separates two problems that are often confused in video question answering: finding plausible alternative answers and deciding when a current answer should actually be changed. On the test split, the final system obtains 90.07 average accuracy and 87.81 macro average accuracy. The report focuses on the final test system and the implementation settings required to reproduce the adaptive dense verifier.
[CV-186] Dual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R
链接: https://arxiv.org/abs/2606.01097
作者: Yuyang Sun,Yongliang Wu,Xingyu Zhu,Yuxia Chen,Zhenxiang Jiang,Yangguang Ji,Wenbo Zhu,Yanxi Shi,Jay Wu,Shuo Wang,Xu Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We describe \emphDual-Route Top-K Retrieval with 1v1 VLM Reranking for the CoVR-R challenge. The method treats composed video retrieval as two coupled problems: finding a sufficiently complete top-k candidate set, and then safely deciding whether any candidate should replace a strong current top-1. We first improve the reasoning/text seed with a VLM slot selector over existing candidates, without introducing DFN visual retrieval. We then add a visual route from contact-sheet embeddings using DFN-H/DFN-L. The routes are merged into a top-10 candidate set, after which a VLM final reranker performs conservative 1v1 comparisons between the current top-1 and each challenger. On the hidden test split, the final system reaches 95.28 R@1, 97.47 R@5, 98.48 R@10, and 99.66 R@50. The main lesson is that CoVR-R benefits more from recall-selection decoupling than from broad text reranking or direct multi-candidate VLM classification.
[CV-187] Chameleon: Style-Content Disentangled Framework for Cross-Domain Object Compositing
链接: https://arxiv.org/abs/2606.01079
作者: Sukhun Ko,Soo Ye Kim,Jihyong Oh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The last two authors are co-corresponding authors. Please visit our project page at this https URL
Abstract:Image compositing aims to seamlessly insert a foreground object into a background image, and recent advances in diffusion models have significantly enhanced the quality, especially when the foreground and background images come from the same domain (e.g., natural images). However, cross-domain compositing, where the foreground and background come from different domains, is relatively underexplored and remains challenging because the model must preserve the foreground object’s identity while stylizing it to match the background domain. Existing cross-domain compositing approaches largely rely on training-free blending and refinement strategies. This is partly due to the lack of large-scale paired datasets for cross-domain compositing, limiting the development of training-based solutions. As a result, they are limited to tone-level alignment and often produce style-inconsistent or overstylized results. To overcome such limitations, we construct ChameleonDataset, the first large-scale training dataset for cross-domain compositing, with a comprehensive evaluation benchmark, built through a scalable data construction pipeline. Building on this, we propose Chameleon, a novel two-stage training-based cross-domain compositing framework. In the first stage, we propose Joint Hard Contrastive Learning (JHCL) to train ChameleonEncoder, which effectively disentangles style and content representations. In the second stage, we introduce Spatio-Temporal Attention Gating (STAG) into a diffusion transformer for effective stylization, adaptively regulating how style tokens from the first-stage encoder are injected across spatial and temporal dimensions. Our method outperforms state-of-the-art in-domain and cross-domain compositing models, sequential pipelines and commercial models, achieving improvements in both compositional plausibility and stylistic fidelity.
[CV-188] Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs
链接: https://arxiv.org/abs/2606.01072
作者: Jianing Qian,Qinhe Peng,Emmanuel Panov,Leonor Fermoselle,Dinesh Jayaraman,Bernadette Bucher,Tarik Kelestemur
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.
[CV-189] A Multiscale Network with Supervised Contrastive Learning for Real-Time Facial Emotion Recognition
链接: https://arxiv.org/abs/2606.01069
作者: Rejoy Chakraborty,Archisman Adhikary,Chayan Halder,Payel Rakshit,Sanchita Ghosh,Kaushik Roy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages
Abstract:Real-time emotion recognition from facial expressions is a challenging task, particularly in video-based scenarios where multiple emotional states may occur over time. The difficulty increases further due to the fact that each emotional state is associated with facial expressions that vary significantly across individuals. The change of facial expressions portraying emotional state is not discrete, but rather continuous, which is very challenging to represent through computational aids. A system with the ability to detect variations in facial expressions can have a significant impact on determining the emotional state of an individual. Such a system can be very beneficial for psychologists during counseling by providing additional insights into the emotional state of a subject. In this paper, a deep learning-based system is presented to detect emotional changes in real-time video of a person by modeling the change in facial expressions. The current study is conducted on a standard dataset for training of the deep learning system and the system has provided very satisfactory outcomes in this respect.
[CV-190] 3DCodeBench: Benchmarking Agent ic Procedural 3D Modeling Via Code WWW
链接: https://arxiv.org/abs/2606.01057
作者: Yipeng Gao,Lei Shu,Genzhi Ye,Xi Xiong,Ameesh Makadia,Meiqi Guo,Laurent Itti,Jindong Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project Page: this https URL 11 pages (main), with appendix
Abstract:Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.
[CV-191] xtFake: Benchmarking AI-Generated Image Detection on Text-Rich Images
链接: https://arxiv.org/abs/2606.01050
作者: Yuning Zhang,Changtao Miao,Mingyu Liao,Tingyu Liu,Xinghao Wang,Tao Gong,Qi Chu,Nenghai Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent AI-generated image (AIGI) detectors perform well on natural-image benchmarks, but their behavior on text-rich forgeries, such as fabricated screenshots, documents, and news pages prevalent in misinformation, remains untested. We introduce TextFake, a 20,000-image benchmark for text-rich AIGI detection spanning 28 languages, 4 topic categories, and 2 scene modalities. Fake images are synthesized via a four-stage pipeline that annotates real images along three controlled dimensions and generates counterparts through distribution-aligned structured prompting, ruling out covariate shortcuts. Zero-shot evaluation of 14 specialized detectors and 3 frontier VLM APIs reveals a large systematic gap: no method exceeds 80% accuracy, with some dropping over 60% from natural-image benchmarks. Diagnostic evaluations identify three failure modes: the Text Density Curse, where dense glyphs overwhelm low-level detectors; Cloaking via Rendering Fidelity, where stronger text rendering suppresses enerative artifacts; and Threshold Collapse, where routine perturbations drive detectors toward chance-level performance.
[CV-192] Decoupled Residual Denoising Diffusion Models for Unified and Data Efficient Image-to-Image Translation CVPR2026
链接: https://arxiv.org/abs/2606.01048
作者: Ziyue Lin,Jiahe Hou,Hongyu Xia,Xinrui Xie,Feifei Wang,Yuyin Zhou,Wei Wang,Jiawei Liu,Liangqiong Qu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026
Abstract:We propose Decoupled Residual Denoising Diffusion models (DRDD) for unified and data-efficient image-to-image (I2I) translation. While diffusion models have advanced I2I translation in terms of quality and diversity, we uncover a previously under-explored property in diffusion models. Crucially, beyond its conventional role of manifold lifting (i.e., moving data off low-dimensional manifolds), injecting Gaussian noise facilitates domain harmonization by implicitly aligning feature distributions across domains, a property particularly advantageous for unified I2I translation. However, existing diffusion models prematurely erode this harmonization effect, as noise and residuals are simultaneously removed in a single coupled diffusion process. To address this, DRDD decouples the diffusion process into two sequential and independent diffusion stages: (1) a stochastic noise diffusion for domain harmonization and manifold lifting, and (2) a deterministic residual diffusion that learns the core semantic mapping entirely within the fixed-noise domain. This decoupling preserves harmonization and manifold lifting effects throughout the transformation, substantially simplifying the learning of unified mappings across diverse tasks and domains. Notably, the noise diffusion stage is trained exclusively on abundant, unpaired target-domain images, greatly improving data efficiency. Comprehensive theoretical and empirical analysis demonstrates that DRDD is broadly compatible with mainstream diffusion models and consistently delivers robust, unified I2I translation, even under limited paired data. Our code is available at this https URL.
[CV-193] Ask4VG: Risk-Aware Question Selection for Reducing Prior-Driven Answers in Medical VQA
链接: https://arxiv.org/abs/2606.01044
作者: Xiaorong Zhu,Qiang Li,Zibo Xu,Weijie Wang,Weizhi Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Medical visual question answering requires models to ground their responses in image evidence, because visually unsupported answers can mislead downstream interpretation. However, many medical VQA questions are generic, template-like, or highly similar in form, which can encourage models to learn question-answer shortcuts instead of image-dependent reasoning and thereby increase the risk of hallucinated responses. We propose Ask4VG, a label-free pilot framework for risk-aware question selection. Ask4VG estimates question-induced hallucination risk through counterfactual visual probing: the same question is asked under the original image, a perturbed image, a blank image, and a mismatched image, and the resulting answer relations are converted into weak supervision for a counterfactual risk estimator. The learned estimator then reranks candidate question rewrites to favor intent-preserving questions that are less invariant to missing or mismatched visual evidence before final answer generation. On VQA-RAD with Qwen2-VL-2B-Instruct, prompt-only rewriting increases counterfactual risk, whereas predicted-risk reranking reduces held-out risk from 0.658 to 0.623 and improves exact accuracy from 0.337 to 0.356. A 300-sample PMC-VQA external check shows the same direction of risk reduction with a small accuracy gain. These results suggest that question selection is a promising complement to response-level hallucination mitigation for reliable medical VQA.
[CV-194] mporally-Aligned Evaluation for Audio-Driven Talking Head Generation
链接: https://arxiv.org/abs/2606.01031
作者: Zhicheng Zhang,Lei Wang,Yu Zhang,Yongsheng Gao
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: Research report
Abstract:Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.
[CV-195] Data Collection for Training Quality-Control AI in Carpet Manufacturing
链接: https://arxiv.org/abs/2606.01023
作者: Akbar Erkinov
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures
Abstract:Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inconsistent at the line speeds and widths of modern looms. We present a design proposal for an in-line machine-vision system whose primary purpose is twofold: to inspect the carpet web in real time and, equally importantly, to systematically collect and label images of defect patterns so that increasingly capable quality-control models can be trained over the life of the this http URL proposal is grounded in a concrete industrial setting: a Six Sigma (DMAIC) project at a woven-carpet production facility that anticipated a production bottleneck following the installation of additional weaving machines, with a substantial baseline defect rate and significant financial exposure associated with quality failures. We describe an imaging subsystem based on synchronized line-scan cameras with combined bright-field and grazing illumination, derive the resolution and throughput requirements needed to resolve fine structural defects across a multi-metre web, and define a carpet-specific defect this http URL then lay out a staged modelling strategy that begins with unsupervised anomaly detection trained on defect-free material, following the paradigm exemplified by the carpet category of the MVTec Anomaly Detection benchmark, and matures through a human-in-the-loop annotation flywheel into supervised detection and segmentation models. Finally, we connect detection performance to the DMAIC objectives, showing how reductions in escaped defects translate into improved process quality and process sigma levels. The contribution is an end-to-end, deployable blueprint that treats data collection as a first-class engineering objective rather than an afterthought.
[CV-196] ProductWebGen: Benchmarking Multimodal Product Webpage Generation KDD2026
链接: https://arxiv.org/abs/2606.01022
作者: Zhihong Liu,Siqi Kou,Zheng Li,Ye Ma,Quan Chen,Peng Jiang,Kai Yu,Zhijie Deng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:Crafting a product display webpage from a source product image, along with layout and visual content instructions, holds significant practical value for domains such as marketing, advertising, and E-commerce. Intuitively, this task demands strict visual consistency across product displays and high-fidelity instruction following to jointly generate renderable HTML code. These requirements on controllability and instruction-following are closely aligned with the core features of advanced multimodal generative models, such as image editing models and unified models. To this end, this paper introduces ProductWebGen to systematically benchmark the product webpage generation capacities of these models. We organize ProductWebGen with 500 test samples covering 13 product categories; each sample consists of a source image, a visual content instruction, and a webpage instruction. The task is to generate a product showcase webpage including multiple consistent images in accordance with the source image and instructions. Given the mixed-modality input-output nature of the task, we design and systematically compare two workflows for evaluation – one uses large language models and image editing models to separately generate HTML code and images (editing-based), while the other relies on a single UM to generate both, with image generation conditioned on the preceding multimodal context (UM-based). Empirical results show that editing-based approaches achieve leading results in webpage instruction following and content appeal, while UM-based ones may display more advantages in fulfilling visual content instructions. We also construct a supervised fine-tuning dataset, ProductWebGen-1k, with 1,000 groups of real product images and LLM-generated HTML code. We verify its effectiveness on the open-source UM BAGEL. The data and code are available at this https URL.
[CV-197] Learning Neural Deformation Representation for 4D Dynamic Shape Generation ECCV2024
链接: https://arxiv.org/abs/2606.01021
作者: Gyojin Han,Jiwan Hur,Jaehyun Choi,Junmo Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2024
Abstract:Recent developments in 3D shape representation opened new possibilities for generating detailed 3D shapes. Despite these advances, there are few studies dealing with the generation of 4D dynamic shapes that have the form of 3D objects deforming over time. To bridge this gap, we focus on generating 4D dynamic shapes with an emphasis on both generation quality and efficiency in this paper. HyperDiffusion, a previous work on 4D generation, proposed a method of directly generating the weight parameters of 4D occupancy fields but suffered from low temporal consistency and slow rendering speed due to motion representation that is not separated from the shape representation of 4D occupancy fields. Therefore, we propose a new neural deformation representation and combine it with conditional neural signed distance fields to design a 4D representation architecture in which the motion latent space is disentangled from the shape latent space. The proposed deformation representation, which works by predicting skinning weights and rigid transformations for multiple parts, also has advantages over the deformation modules of existing 4D representations in understanding the structure of shapes. In addition, we design a training process of a diffusion model that utilizes the shape and motion features that are extracted by our 4D representation as data points. The results of unconditional generation, conditional generation, and motion retargeting experiments demonstrate that our method not only shows better performance than previous works in 4D dynamic shape generation but also has various potential applications.
[CV-198] Cross-Axis Feature Fusion with Joint-Wise Motion Difference Prediction for Text-Based 3D Human Motion Editing CVPR2026
链接: https://arxiv.org/abs/2606.01014
作者: Gyojin Han,Junmo Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: CVPR 2026
Abstract:We address text-based 3D human motion editing, where the goal is to preserve the style and structure of a source motion while applying edits described in natural language. The release of the MotionFix dataset has spurred active research into training-based diffusion models that directly generate an edited motion from a source motion and a text instruction. While previous works have focused primarily on learning when an edit should occur temporally, our goal is to create a model that understands not only this temporal aspect but also which specific joints are responsible for the change. Targeting this, we propose a novel architecture and a complementary auxiliary task to aid its training. Our architecture consists of two axis-anchored transformers, which extract distinct features along the joint and time dimensions respectively, and a cross-axis fusion block that integrates these representations. We further introduce an auxiliary task that trains the joint-anchored transformer to regress the Soft-DTW distance between source and target joint rotations. This objective teaches the module to understand which joints to modify and which to preserve. Through comprehensive experiments on the MotionFix dataset, we demonstrate that our method significantly improves semantic alignment with both the text instruction and the source motion, as well as the overall fidelity of the generated motion, achieving state-of-the-art results.
[CV-199] Automated Erythrocyte Detection and Tracking for Retinal Blood Flow Quantification in Erythrocyte-Mediated Angiography
链接: https://arxiv.org/abs/2606.01006
作者: Chiao-Yi Wang,Havish S Gadde,Yi-Ting Shen,Saige M. Oechsli,Osamah Saeedi,Yang Tao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Capillary-level retinal blood flow (RBF) has strong potential as a biomarker for various ocular diseases. However, modalities for measuring capillary-level RBF remain limited. Erythrocyte-mediated angiography (EMA), an emerging imaging technique, enables capillary-level RBF measurement by visualizing individual erythrocytes, yet automated erythrocyte detection and tracking, which are essential for quantifying blood flow, remain largely unexplored. To address this gap, we propose EMTrack, a novel framework featuring a flow-context module for erythrocyte detection that distinguishes moving from paused cells and a topology-aware tracking strategy that enables tracking under large inter-frame displacements and substantial motion variations. In addition, we establish RBF-EMA, a new EMA dataset with comprehensive erythrocyte detection and tracking annotations. Experimental results demonstrate that our method outperforms baseline methods both quantitatively and qualitatively on detection and tracking tasks in the RBF-EMA dataset. Moreover, RBF quantification results highlight the strong potential of our framework for automated retinal blood flow measurement.
[CV-200] SWARD: Stochastic Window-Attention-Based Relational Distillation for Cross-Architectural Semantic Segmentation
链接: https://arxiv.org/abs/2606.00999
作者: Aditya Makineni,Qing Tian
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale vision foundation models have driven substantial gains on dense prediction tasks such as semantic segmentation, but their size makes deployment impractical in resource-constrained settings, motivating knowledge distillation as a means of transferring their capabilities to lightweight student networks. However, modern foundation teachers are predominantly transformer-based that encode global context, whereas efficient students are typically convolutional networks with locally biased receptive fields. Existing distillation methods largely assume architectural homogeneity and rely on direct feature mimicry, which fails to bridge this representational gap and neglects the structured spatial dependencies and discriminative organization required for accurate semantic segmentation. In this paper, we propose SWARD, a knowledge distillation framework that addresses this gap through two complementary mechanisms. First, we introduce a Multi-Scale Windowed Attention Distillation (MWAD) module that aligns teacher-student attention-based relations within stochastically shifted window partitions whose offsets are randomly resampled at every training iteration. This removes window boundary bias, and, combined with the multi-scale design, captures both short- and long-range spatial dependencies. Second, we introduce Prototype Discriminative Regularization (PDR), a loss that helps shape the student’s feature distribution by enforcing inter-class separation and intra-class compactness, further sharpening the discriminative structure beyond what feature mimicry alone can produce under the student’s reduced capacity. Experiments across different vision applications (i.e., urban scene parsing and medical image segmentation) show that SWARD achieves state-of-the-art performance.
[CV-201] An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
链接: https://arxiv.org/abs/2606.00987
作者: Bingyu Li,Da Zhang,Tao Huo,Zhiyuan Zhao,Junyu Gao,Xuelong Li
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have shown strong visual understanding and language-guided grounding abilities, yet their capacity for multi-temporal visual reasoning remains underexplored. To bridge this gap, we introduce \textbfMulti-temporal Referring Segmentation (MTRS), a new task that aims to segment language-described temporal changes from multi-temporal images. MTRS extends conventional referring segmentation and change detection by jointly requiring temporal correspondence reasoning, language grounding, and pixel-level mask prediction. We propose \textbfCRAFT-Agent, an automated data construction pipeline with human auditing, and build \textbfMTRefSeg-21K, the first MTRS benchmark, containing 21K high-quality multi-temporal image-text-mask triplets across diverse scenes, viewpoints, and domains. Benchmarking a broad set of VLM- and LVLM-based models reveals that direct inference performs poorly, while task-specific fine-tuning remains limited. To address this, we propose \textbfMTRefSeg-R1, a change-aware LVLM framework trained with a two-stage strategy. It first learns general temporal-change perception from 20K vision-only bi-temporal samples, and is then fine-tuned on MTRefSeg-21K for fine-grained language-guided temporal localization. MTRefSeg-R1 explicitly models cross-temporal visual differences, aligns language instructions with temporal variations, and predicts referred change masks. Extensive experiments show that MTRefSeg-R1 achieves strong and often superior performance compared with existing LVLM baselines, demonstrating the challenge and potential of MTRS.
[CV-202] Flexible Control of 3D CT Generation via Text and Semantically-Defined Segmentation Prompts
链接: https://arxiv.org/abs/2606.00967
作者: Weicheng Dai,Chenyu Wang,Andy Li,Shantanu Ghosh,Kayhan Batmanghelich
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative models for volumetric medical images have found many applications in medical imaging, ranging from data augmentation to serving as priors for inverse problems. For these applications, generating high-resolution 3D images with strong controllability is essential but remains highly challenging. Existing approaches typically control generation either through radiology reports used as text prompts or through full image segmentation. While text-based prompting is flexible, it provides limited spatial control over the location, shape, and boundary of abnormalities. In contrast, segmentation-based methods receive precise spatial guidance but are restrictive in requiring full-organ annotations. In this work, we propose a flexible multimodal framework for controllable volumetric image generation that supports input from radiology reports and segmentation prompts (both optional). Our approach allows users to provide segmentation of a specific anatomy or abnormality without requiring full-organ annotations. The semantic meaning of the segmentation mask is specified through an accompanying text description, resulting in a highly flexible and scalable conditioning mechanism. We develop a memory-efficient architecture based on a modified diffusion transformer that jointly processes image and segmentation tokens. The model further incorporates gated attention to effectively attend to long radiology reports. Experiments demonstrate that our method achieves state-of-the-art perceptual and semantic scores (e.g., 24% relative improvement in mean FID), generates high-resolution anatomically consistent CT volumes, and improves data efficiency when used for data augmentation. Radiologists’ evaluation further confirms strong alignment between generated and real medical images.
[CV-203] Boundary-Protection W8A8 HiFloat8 Quantization for Large-Scale Text-to-Video Diffusion Transformers ICME2026
链接: https://arxiv.org/abs/2606.00957
作者: Yiming Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 5 figures. Accepted to ICME 2026 Grand Challenge
Abstract:We present a post-training quantization (PTQ) approach for Wan2.1-T2V-14B, a 14-billion-parameter text-to-video diffusion transformer, targeting the W8A8 HiFloat8 (HiF8) format on Ascend 910B NPUs. A central challenge in quantizing video DiT models is the heterogeneous activation distribution across transformer blocks: boundary blocks (the first and last few blocks) exhibit fundamentally different statistical properties from middle blocks, making uniform quantization ineffective. We conduct a systematic per-block activation analysis across all 40 WanAttentionBlocks and use the findings to motivate a boundary-protection strategy that retains the first two and last three blocks in BF16 while quantizing the remaining 35 blocks with W8A8 HiF8. The proposed PTQ method matches or marginally exceeds the BF16 baseline on all five VBench dimensions evaluated, indicating no measurable accuracy loss within the 5-prompt evaluation set. An ablation study over four protection configurations confirms that full boundary protection yields the highest average VBench score, validating the data-driven block selection. We additionally investigate quantization-aware training (QAT) as a complementary fine-tuning stage and analyze the conditions under which it fails to outperform plain PTQ on single-card hardware.
[CV-204] COLLAR: Cascaded Object-Level Latent Refinement for High-Fidelity Conditional Generation
链接: https://arxiv.org/abs/2606.00954
作者: Xinlong Zhang,Jia Wei,Xiaoyu Zhang,Teng Zhou,Chengyu Lin,Yongchuan Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving high-fidelity object-level control in Diffusion Transformers remains a significant challenge despite the introduction of structural priors like depth and Canny maps. Current object-level conditional generation methods frequently suffer from visual artifacts and struggle to maintain precise control over objects within small localized regions. To address these limitations, we propose Cascaded Object-Level Latent Refinement (COLLAR), a training-free framework that progressively optimizes object-level features via the Field-of-View (FoV) expansion. First, we propose the Cross-Scale Semantic Alignment (CSSA) module to address spatial-semantic gaps by injecting object-level features into extended-FoV branches via attention mechanisms. To further optimize these features, the Cyclic Feature Injection (CFI) module introduces a reciprocal background feedback mechanism. It leverages a frequency-based adaptive strategy to selectively update the global backbone with context-aligned local information. Finally, the extended-FoV branch serves as a hub for feature optimization, ensuring that object-level features are integrated into the global generation process without compromising final image quality. Extensive experiments on the COCO-MIG and COCO-POS benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods across semantic alignment, image quality, and spatial fidelity.
[CV-205] One Channel to Rule Them All: Rethinking Input Representation for Visual Place Recognition
链接: https://arxiv.org/abs/2606.00936
作者: Timur Ismagilov,Shakaiba Majeed,Michael Milford,Tan Viet Tuyen Nguyen,Sarvapali D. Ramchurn,Shoaib Ehsan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:Visual Place Recognition (VPR) is fundamental to long-term robot localization and SLAM, yet current systems overwhelmingly rely on RGB input, implicitly assuming color is necessary for global place recognition. We challenge this assumption, investigating the role of chromatic information across training regimes, model architectures and standard benchmarks under real-world appearance variation. We find that grayscale matches RGB performance generally and outperforms it under severe appearance shifts where color invariance is insufficiently learned, while color provides meaningful gains only where persistent and discriminative chromatic cues are present. Across selected benchmarks, a fully gray-trained MixVPR model achieves an average 82.4% Recall@1 compared to 81.2% for its RGB counterpart. In some cases, lightweight grayscale variants with 60% fewer parameters can outperform heavier RGB models. Grayscale further offers practical advantages in storage, bandwidth and alignment with resource-constrained systems. We conclude that for global VPR where scenes vary across illumination, weather, season and setting, color contributes minimally, and grayscale alone is sufficient for reliable place recognition.
[CV-206] CV-Arena: An Open Benchmark for Instructional Computer Vision Problem Solving with Human-AI Collaborative Preferences
链接: https://arxiv.org/abs/2606.00931
作者: Fangzhou Lin,Peiran Li,Lingyu Xu,Wenjing Chen,Qianwen Ge,Shuo Xing,Mingyang Wu,Xiangbo Gao,Siyuan Yang,Kazunori Yamada,Ziming Zhang,Haichong Zhang,Zhen Dong,Ming-Hsuan Yang,Zhengzhong Tu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 26 pages, 7 figures, 11 tables
Abstract:Instruction-guided image editing is becoming a general interface for visual work, yet existing benchmarks still focus largely on narrow appearance edits and do not fully capture the diversity of real-image tasks in professional workflows. Here, we define instructional computer vision problem solving as a broader formulation of image editing: given a real input image and a natural-language instruction, a system must produce an edited output that realizes the requested transformation while satisfying explicit preservation, geometric, physical, and usability constraints. We introduce CV-Arena, an open benchmark designed to evaluate this capability at professional scales. CV-Arena contains 12K high-resolution real-image instruction pairs spanning 16 instruction-based visual task types, constructed using CogRetriever, a dual-track retrieval-and-curation pipeline that combines targeted web search, agentic query refinement, verification, and traceability. To evaluate models at scale while preserving human fidelity, we propose Active Elo, a human-AI collaborative preference protocol that leverages CV-Judge, a logic-gated, multi-dimensional VLM evaluator, to reject clear failures and resolve high-confidence comparisons; and to route close, high-quality comparisons to expert raters. Mixed human and AI supervision is then aggregated through reliability-weighted Elo updates. Our comprehensive evaluation of 21 systems, including proprietary, open-source, and agentic models, on CV-Arena reveals persistent gaps in instruction adherence, physical reasoning, structural control, and fine-grained detail preservation. We further develop CV-Agent, a lightweight agentic model that combines planning, editing, and verification, and demonstrate that closed-loop reasoning is a promising direction for professional-grade instruction-following visual editing.
[CV-207] Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models
链接: https://arxiv.org/abs/2606.00928
作者: Sakib Mohammad,Jarin Ritu,Md Sakhawat Hossain
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 6 pages, 3 figures
Abstract:Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.
[CV-208] Bridging Topology and Deep Representation Learning: A TDA-ViT Fusion Model for Four-Class Brain Tumor Classification
链接: https://arxiv.org/abs/2606.00927
作者: Faisal Ahmed
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 4 figures
Abstract:Accurate brain tumor classification from magnetic resonance imaging (MRI) is a key requirement for early diagnosis and clinical decision-making. Vision Transformers (ViTs) have shown strong performance in medical image analysis by learning global contextual representations, but they often fail to capture intrinsic structural and topological patterns present in tumor regions. To address this limitation, we propose a fusion framework that combines Topological Data Analysis (TDA) features with pretrained Vision Transformer representations for four-class brain tumor classification. In the proposed method, TDA is used to extract complementary topological descriptors that capture geometric structure, connectivity, and shape information from MRI images. In parallel, a pretrained ViT model learns high-level semantic representations from the same images. These two feature spaces are then fused to form a unified and more discriminative representation for classification. The model is evaluated on the BRISC2025 dataset, which contains four brain tumor classes: glioma, meningioma, pituitary tumor, and non-tumor cases. Experimental results show that combining topological and transformer-based features significantly improves performance compared to using either approach alone. The proposed TDA-ViT fusion model achieves an accuracy of 99.10%, precision of 99.27%, recall of 99.15%, F1-score of 99.21%, and an AUC of 99.98%. It also outperforms several state-of-the-art models, including ResNet50, ResNet101, EfficientNetB2, and standalone Vision Transformers. These results demonstrate that topological features provide valuable complementary information that enhances deep representation learning, leading to a robust and highly accurate framework for automated brain tumor classification. Comments: 21 pages, 4 figures Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.00927 [cs.CV] (or arXiv:2606.00927v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00927 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-209] Reason Retrieve Re-rank: A Zero-Shot Reasoning -Aware Framework for Composed Video Retrieval
链接: https://arxiv.org/abs/2606.00910
作者: Ali Alavi
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emphReason-Aware CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbfR3-CoVR (\emphReason, Retrieve, Re-rank), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emphafter-effects an edit implies – state transitions, action phases, scene, camera and tempo – and verbalises a concise post-edit description; a contrastive video–text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf91.9% R@1 and \textbf98.2% R@10. Two findings drive these results: (i)~matching the description length to the contrastive encoder’s text window lifts \Rk1 from 67.5 to 72.7 ; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk1 from 72.7 to 91.9 – the single largest gain. We analyse the re-ranker’s behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.
[CV-210] hZACH-ViT: Curved Latent Geometry for Compact Vision Transformers in Low-Data Medical Imaging
链接: https://arxiv.org/abs/2606.00906
作者: Athanasios Angelakis
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at this https URL upon publication
Abstract:Compact Vision Transformers are attractive for medical imaging in low-data and resource-constrained settings, but most existing variants assume that Euclidean latent geometry is sufficient for organizing image representations. We introduce hZACH-ViT, a family of curved-geometry extensions of ZACH-ViT, a compact zero-token Vision Transformer that removes positional embeddings and the class token and relies on global average pooling over patch representations. To isolate the role of geometry, we preserve the verified ZACH-ViT backbone and modify only the final representation space and prototype-based classifier head, enabling a controlled comparison between Euclidean, hyperbolic, and spherical latent geometries. We evaluate Poincaré, Klein, and spherical hZACH-ViT heads on seven MedMNIST datasets under an identical few-shot protocol with 50 samples per class and five random seeds. The completed benchmark contains 770 training runs spanning seven datasets, three non-Euclidean geometries, seven curvature magnitudes, and a Euclidean baseline. Across all seven datasets, the best non-Euclidean hZACH-ViT configuration improves over Euclidean ZACH-ViT, with an average gain of +0.021 in the dataset-specific primary metric and the largest improvement on OCTMNIST (+0.055 MacroF1). Fixed low-curvature configurations retain positive gains on the majority of datasets, and low curvature values (c = 0.1 or 0.2) account for six of the seven dataset-level winners. Rather than identifying a universally optimal manifold, our results establish geometry and curvature as dataset-dependent model-selection variables, with fixed low-curvature analyses confirming that gains persist beyond exhaustive per-dataset tuning. Comments: 17 pages, 2 figures, 4 tables. Code, execution notebooks, and aggregated result summaries will be released at this https URL upon publication Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.00906 [cs.CV] (or arXiv:2606.00906v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00906 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-211] MMDG-Bench: A Benchmark for Multimodal Domain Generalization
链接: https://arxiv.org/abs/2606.00891
作者: Qianshan Zhan,Qian Wang,Da Li,Xiao-Jun Zeng,Xiatian Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal Domain Generalization (MMDG) seeks to leverage complementary modalities to enhance model robustness on unseen domains. Despite extensive progress in Multi-modal Learning (MML) and Domain Generalization (DG) as individual fields, their systematic integration remains under-explored. Current MMDG research is largely confined to action recognition and lacks standardized evaluation protocols. To address this, we introduce MMDG-Bench, a comprehensive benchmark featuring two foundational frameworks: DG then MML (D2M) and MML then DG (M2D). We provide unified experimental protocols across diverse tasks, including video-audio-flow action recognition and RGB-Depth-IR face anti-spoofing. By instantiating ten MMDG baselines through pairing a unified MML configuration with five DG techniques under both D2M and M2D orderings, we demonstrate that these structured combinations frequently outperform existing state-of-the-art methods, underscoring the necessity of a unified benchmarking effort. Our analysis yields three key insights: (1) Integrating DG techniques provides consistent generalization gains across various backbones, whereas non-DG methods are highly sensitive to backbone shifts; (2) The optimal framework choice depends on inter-modal stability: D2M excels when modal relations are stable across domains, while M2D is more robust to cross-domain relational variance; (3) Stronger backbones yield amplified performance dividends when integrated into our structured frameworks. MMDG-Bench provides a principled foundation and actionable design guidelines for future research in multi-modal robustness. Code is released at this https URL.
[CV-212] Cohort-Scale Neural Atlases of Ultrasound Video
链接: https://arxiv.org/abs/2606.00890
作者: Zhuorui Zhang,Roger Pallarès-López,Xuan Wu,Praneeth Namburi,Brian W. Anthony
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound is the most widely used real-time imaging modality in clinical practice, yet per-frame video annotation remains a major bottleneck: expert labels are scarce and costly, and image appearance varies with speckle, shadowing, attenuation, and operator-dependent probe pose. This is especially limiting because clinically relevant information is often dynamic, from left-ventricular motion in echocardiography to muscle and bone kinematics in musculoskeletal imaging. Population atlases can amortize annotation cost by registering observations to a shared canonical coordinate system, but existing neural atlas methods mainly target single videos, small test-time image sets, or object-centric image collections. We introduce a cohort-scale neural atlas for ultrasound video: a single canonical chart with per-video Generative Latent Optimization embeddings, trained jointly over thousands of frames in DINOv3 feature space. Across five cardiac and musculoskeletal datasets with point landmarks and segmentation masks, our method learns coherent canonical templates and enables accurate atlas-space annotation transfer. On EchoNet-Dynamic and MSK-Bone, it supports single- and few-shot transfer with accuracy competitive with strong dense-correspondence baselines, while training in minutes on a single consumer GPU. The learned embeddings are interpretable: linear projections reveal structured cohort variation, image-decoder interpolation produces anatomically plausible intermediate frames, and test-time latent inversion reconstructs held-out frames through the atlas. These results suggest that cohort-scale neural atlases offer a practical, interpretable representation for reducing expert annotation burden in ultrasound video analysis.
[CV-213] GABI: Geometry-Aware Boundary Integration for Spacecraft Segmentation CVPR2026
链接: https://arxiv.org/abs/2606.00886
作者: Iason Georgios Velentzas,Dhruv Ahuja,Panagiotis Tsiotras
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to AI4Space at CVPR 2026
Abstract:Accurate segmentation is crucial for autonomous spacecraft, as it directly affects downstream tasks related to 3D situational awareness. The harsh illumination conditions of space, however, produce images with high variability in appearance, hindering the generalization of segmentation approaches across different spacecraft and environments. In this work, we propose GABI, a lightweight boundary-aware multi-task segmentation architecture that augments a convolutional backbone with an auxiliary distance-field prediction head. The distance field provides dense geometric supervision around object boundaries, encouraging the network to learn spatially consistent representations of spacecraft structures while maintaining low model complexity suitable for onboard perception systems. We evaluated GABI against both an established convolutional baseline and a heavier transformer-based architecture. On the SPARK benchmark, distance-field supervision improves the baseline by up to 5% in Average Precision while achieving performance comparable to the transformer models. In generalization experiments, GABI improves Average Precision by more than 50% over the baseline. In cross-domain evaluation, the lightweight GABI variant performs within 5% in IoU and F1-score of the heavier transformer model while being approximately ten times smaller. At the same time, the heavier GABI variant surpasses the transformer architectures while remaining nearly three times lighter.
[CV-214] Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images ICML2026
链接: https://arxiv.org/abs/2606.00872
作者: Jan Philip Walter,Shashank Agnihotri,Margret Keuper
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as a Spotlight Oral at the ICML 2026 Workshop Foundation Models for Structured Data. *Equal Contribution
Abstract:AI-generated image detection is a moving-target problem: detectors trained on one generator often fail when a new generator appears, and only a few labeled examples are available. We study a simple image-to-table formulation for this regime, where each image is encoded by a frozen DINOv3 backbone, its CLS feature is reduced to a 500-dimensional structured row with PCA, and TabPFN performs real/fake classification by in-context tabular inference rather than task-specific classifier training. This turns fake-image detection into low-data structured prediction over learned visual features, making detector adaptation depend on the labeled context set instead of gradient-based fine-tuning. On GenImage, LATTE, a recent state-of-the-art detector, remains stronger when many labeled samples from all generators are available, by 7.4% in the largest pooled setting, but DINOv3-PCA-TabPFN is stronger in the practically important low-data regime, outperforming LATTE by up to 8.2%, and in transfer settings where the detector must generalize from one generator to another. These results position tabular foundation models as a strong complementary adaptation mechanism for image forensics, shifting adaptation from detector retraining to lightweight in-context updates with a small labeled set of examples. Code URL: this https URL
[CV-215] Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated ICML2026
链接: https://arxiv.org/abs/2606.00871
作者: Rashid Mushkani
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Vision-language models (VLMs) are increasingly used to generate structured descriptions of street-level imagery for tasks such as streetscape auditing, mapping, and public consultation. These uses combine observable attributes with appraisal categories, and the human targets are often distributions of judgments with disagreement and explicit non-response. This paper argues that benchmarking VLMs for urban perception should treat disagreement and abstention as measurement outcomes, report inter-annotator reliability alongside model alignment, and treat the label space and scoring policy as negotiable artifacts when outputs are intended to inform urban governance. We ground the argument in a benchmark of 100 Montreal street scenes annotated along 30 dimensions by 12 participants from seven community organizations, and in a deterministic zero-shot evaluation of seven VLMs. Across dimensions, model agreement with human consensus co-varies with dimension-level human reliability, and for the appraisal dimension Overall Impression models and annotators exhibit distributional mismatch including different rates of Not applicable. We close with actions for benchmark creators, model developers, and institutions to make uncertainty and benchmark assumptions visible in evaluation reports.
[CV-216] RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection
链接: https://arxiv.org/abs/2606.00852
作者: Vinay Edula,Nilesh Badwe,Priyanka Bagade
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector’s task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.
[CV-217] MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts
链接: https://arxiv.org/abs/2606.00844
作者: Vinay Edula,Priyanka Bagade
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.
[CV-218] he Right Inference Strategy Is All You Need: Nearly Training-Free Domain-Wise Inference for EgoCross Challenge
链接: https://arxiv.org/abs/2606.00829
作者: Leyi Wu,Yifan Zhao,Jinjie Zhang,Yinchuan Li,Ying-Cong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:EgoCross evaluates multimodal large language models on egocentric video question answering under substantial domain shift, where test videos come from surgery, industrial assembly, extreme sports, and animal-mounted cameras rather than ordinary daily-life scenes. In the source-limited track, the base model is fixed to Qwen3-VL-4B, while the official task-specific support set contains only 20 training samples. This setting makes the challenge less about model scaling and more about exposing the right visual, temporal, and answer-selection cues to a constrained model. Our key observation is that the frozen baseline model is not simply incapable of these rare scenarios; rather, it often fails to transfer its existing visual-language knowledge to the new task format without an appropriate interface. We therefore use a domain-wise inference strategy that treats the four target domains separately and designs different input, prompting, and answer-mapping procedures according to each domain’s task characteristics. These strategies make the rare egocentric scenes more interpretable to the VLM by emphasizing the cues that matter for each domain. The resulting system is nearly training-free: surgery, and animal questions are answered with the base Qwen3-VL-4B model, while XSports and industry use only the official SFT checkpoint trained for two epochs on the provided 20 training samples. On the final evaluation, this simple strategy reaches 66.98% overall accuracy, suggesting that careful domain-aware inference can compensate for limited base-model strength and recover much of the ability already present in the baseline model.
[CV-219] RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
链接: https://arxiv.org/abs/2606.00828
作者: Leyi Wu,Yifan Zhao,Jinjie Zhang,Suzeyu Chen,Wosong Chen,Zhifei Chen,Tianshuo Xu,Qingchun He,Hongxin Hu,Haojian Huang,Yangkai Wei,Wenqian Li,Yinchuan Li,Ying-Cong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs) have shown strong visual understanding and are increasingly deployed in embodied AI systems, where reliable perception under real conditions is essential. However, existing benchmarks assess VLMs using clean images or isolated perturbations rather than stresses caused by physical scene formation. This design has two limitations: it covers only a narrow subset of everyday visual stresses, and some perturbations rarely appear in realistic embodied scenes. This gap raises a fundamental question: how can we define visual stress in a principled way that captures the diverse factors encountered in physical environments? To address this question, we formulate visual perception from an inverse graphics perspective and introduce RoboStressBench, a benchmark for evaluating VLM robustness to physical visual stress in embodied scenes. Inspired by the physical rendering equation, RoboStressBench decomposes visual stress into four physically grounded dimensions: Material (M), Viewpoint (V), Lighting (L), and Geometry (G). This design enables RoboStressBench to cover a broad range of visual stresses in real-world environments, while allowing controlled analysis of their effects on VLM capabilities such as visual recognition, reasoning, and planning. Through comprehensive evaluations of state-of-the-art VLMs, we identify stress-specific failure modes and reveal that different physical factors degrade different embodied capabilities, which are often obscured by aggregate accuracy. We further introduce a stress-aware agentic solver that detects visual stressors and invokes visual-editing skills before reasoning, improving robustness in high-stress scenarios. Overall, RoboStressBench provides a principled evaluation framework for diagnosing and improving VLM perception under real-world physical stress, supporting the development of more reliable embodied AI systems.
[CV-220] Directed Distance Fields for Constant-Time Ray Queries on Gaussian Splatting
链接: https://arxiv.org/abs/2606.00817
作者: Subhankar MIshra
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D Gaussian Splatting (3DGS) renders new views of a scene in real time. Like every rasterizer, it answers only primary rays, the rays from the camera through the image. It cannot trace the secondary rays that shadows, ambient occlusion, and global illumination need. We turn a trained 3DGS scene into a ray oracle by distilling a Directed Distance Function (DDF). The DDF is a small neural field. It takes a ray, given by an origin and a direction, and returns the distance to the first surface and whether the ray hits anything. Each query is one forward pass. The field is 52~MB, and its size does not depend on the number of Gaussians, so its cost and memory stay flat as the scene grows. We make three points. First, we study what supervision a DDF needs. Depth rendered from the Gaussians is too blurry to teach thin parts, while clean distance supervision recovers them. Second, we measure speed. The DDF is 26 to 72 times faster than sphere tracing an equivalent signed distance field, and unlike a bounding volume hierarchy built over the Gaussians, even on dedicated RT-core hardware, its query time and memory do not grow with the scene. Third, we show a pipeline that needs no mesh: images give a 3DGS scene, a neural surface gives clean distances, and the DDF learns from them. We use the DDF as a secondary-ray oracle for global illumination. It reproduces reference ray-traced shadows at 30.3~dB and ambient occlusion at 21.3~dB across 142 objects, and on real captured scenes. Our codes are available at this https URL.
[CV-221] DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models
链接: https://arxiv.org/abs/2606.00798
作者: Abdullah Al Shafi,Kazi Saeed Alam,Sk Imran Hossain,Engelbert Mephu Nguifo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results
Abstract:Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher’s converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.
[CV-222] MBench: A Comprehensive Benchmark on Memory Capability for Video World Models
链接: https://arxiv.org/abs/2606.00793
作者: Shengjun Zhang,Zhang Zhang,Simin Huang,Zhenyu Tang,Hanyang Wang,Chensheng Dai,Min Chen,Yifan Li,Yuxin Li,Yingjie Chen,Hao Liu,Chen Li,Yueqi Duan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent advancements in video-based world models have demonstrated an unprecedented ability to synthesize high-fidelity visual sequences. However, a fundamental gap persists between visually plausible video generation and the functional requirements of a world model, particularly in maintaining a stable and reasonable internal state over extended temporal horizons. While existing benchmarks primarily emphasize visual quality, motion coherence, and text-video alignment, they largely overlook memory, the core capability of a world model to preserve consistency across long-term horizons and complex interactions. To address this gap, we present \textbfMBench, a comprehensive benchmark dedicated to quantifying and evaluating the memory capability of video world models. We systematically decompose the memory capability of video world models into three hierarchical and complementary core dimensions: entity consistency, environment consistency, and causal consistency, which are further refined into 12 quantifiable sub-dimensions for comprehensive characterization of long-term memory. Our benchmark is built upon rigorously curated real-captured long videos, and evaluated by rule-based quantitative matrices and VLM to enable objective and comprehensive consistency assessment. Extensive evaluations of mainstream state-of-the-art video world models reveal critical systemic limitations of existing methods in long-term state retention, providing a standardized benchmark and clear research direction to advance the field.
[CV-223] DINO-GFSA: Geo-Localization via Semantic Gated Fusion and Mamba-based Sequential Aggregation
链接: https://arxiv.org/abs/2606.00784
作者: Beier Hu,Yuanshen Guo,Jialu Cai,Chengwei Li,Yong Wang,Shunan Wu,Zhigang Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cross-view geo-localization (CVGL) is critical for Unmanned Aerial Vehicle (UAV) self-positioning and target localization in GNSS-denied environments. However, acquiring robust semantics while preserving finegrained spatial details remains challenging. To address this, we propose DINO-GFSA, a framework leveraging a LoRA (Low-Rank Adaptation) adapted DINOv3 (ViTL) backbone for parameter-efficient, high-capacity representation. Crucially, we introduce a Semantic Gated Residual Fusion module, which utilizes high-level semantics to selectively calibrate and integrate low-level spatial cues, effectively bridging the semantic gap. Furthermore, a Mamba-based Sequential Aggregation Head is designed to capture long-range spatial dependencies with linear complexity. Experiments demonstrate state-of-the-art performance on University-1652 and DenseUAV benchmarks, notably surpassing the previous best on DenseUAV by 3.48% on Recall@1. These results validate DINO-GFSA as a generalized, robust solution for UAV CVGL.
[CV-224] FlowOVD: Learning Generative Latent Flows for Zero-shot Open-vocabulary Detection
链接: https://arxiv.org/abs/2606.00782
作者: Yao Wei,Andrea Cavallaro,Changjae Oh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Open-vocabulary object detection (OVD) has achieved remarkable progress through large-scale vision-language pre-training. Existing methods, however, typically formulate OVD as a discriminative prediction problem, where decoder queries are either static or initialized from encoder features, thus limiting their diversity and flexibility. In this paper, we introduce a generative perspective by modeling decoder query generation as a continuous transport process in latent space. We propose FlowOVD, a text-conditioned query generation framework based on rectified flow that progressively transforms text-agnostic queries into text-guided queries. By introducing continuous latent query dynamics into a vision-language model (VLM) based detector, our method avoids heuristic discrete query construction and enables more expressive semantic alignment for open-vocabulary detection. Without requiring additional training data, FlowOVD achieves 49.5 AP on COCO and 31.5 AP on LVIS, outperforming GroundingDINO by +1.2 AP (+2.5 %) and +4.1 AP (+15.0 %), respectively. The larger gain on the challenging long-tailed LVIS benchmark further highlights the effectiveness of continuous query generation for open-vocabulary generalization.
[CV-225] GIRL-DETR: Gradient-Isolated Reinforcement Learning for Video Moment Retrieval
链接: https://arxiv.org/abs/2606.00775
作者: Shihang Zhang,Mingjin Kuai,Ye Wei,Zhen Zhang,Wei Ji
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 6 figures. Submitted to IEEE Transactions on Image Processing (TIP). Code is available at: this https URL
Abstract:Video Moment Retrieval (VMR) task requires accurately localizing temporal boundaries aligned with natural language queries, but many models suffer from a misalignment between continuous surrogate losses and non-differentiable metrics, leading to optimization stagnation during the late stages of training and trapping boundary predictions in suboptimal solutions. Although Reinforcement Learning (RL) post-training successfully optimizes localization results for large models, applying it directly to lightweight networks easily disrupts the fragile feature representations established during the supervised phase. To overcome this optimization bottleneck, we propose Gradient-Isolated Reinforcement Learning for DETR (GIRL-DETR), introducing RL post-training into a lightweight temporal localization framework for the first time. The input video and text features first establish early alignment through Cross-Modal Interaction (CMI) before entering the transformer encoder. Subsequently, a Text-Guided Gating (TGG) mechanism dynamically injects semantic priors into the queries before the transformer decoder generates candidate proposals, providing high signal-to-noise ratio inputs for temporal prediction. After the supervised training reaches convergence, the backbone network is frozen to protect the feature manifold, while the detection head directly optimizes the non-differentiable evaluation metric tIoU to enhance localization accuracy through a Three-stage Progressive Reinforcement Learning (TPRL) strategy. This approach achieves an orthogonal decoupling of state representation and metric optimization. Experiments on Charades-STA, QVHighlights, and TACoS demonstrate that GIRL-DETR effectively resolves surrogate loss degradation and achieves substantial accuracy improvements with minimal parameter updates, providing a robust new pathway for RL applications in lightweight VMR models.
[CV-226] Head-Pose-Aware Visual Speech Recognition with FiLM Modulation
链接: https://arxiv.org/abs/2606.00751
作者: Matthew Kit Khinn Teng,Haibo Zhang,Takeshi Saitoh
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 27 pages, 4 figures
Abstract:Visual Speech Recognition (VSR) aims to recognize speech from visual cues such as lip movements, but its performance is fundamentally limited by viseme ambiguity and pose-induced variations that introduce geometric distortions and occlusions. Existing approaches mainly rely on linguistic context or implicit invariance, leaving visual representations insufficiently robust under non-frontal views. In this work, we propose a pose-aware phoneme-level framework, termed HP-VSR-ResFiLM, that explicitly incorporates head-pose information into visual feature extraction. The proposed framework adopts a two-stage pipeline consisting of a pose-conditioned visual encoder in Stage 1 and a pretrained NLLB language model in Stage 2 for phoneme-to-text reconstruction. Specifically, Stage 1 incorporates a pose-conditioned residual Feature-wise Linear Modulation (FiLM) block after the 2D CNN frontend to adaptively refine visual representations using head-pose information. Experiments on LRS2 and LRS3 demonstrate that HP-VSR-ResFiLM achieves competitive performance under comparable training conditions, attaining word error rates (WER) of 25.0% and 33.2%, respectively, without relying on additional training data. Ablation studies further show that a single residual FiLM block consistently improves overall WER, while deeper modulation at Layers 3 and 4 provides larger gains for samples with yaw angles greater than 30° without degrading performance for smaller pose variations. These findings demonstrate that explicit pose-aware feature modulation offers an effective and computationally efficient solution for improving VSR robustness in unconstrained settings.
[CV-227] SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy
链接: https://arxiv.org/abs/2606.00747
作者: Jie Gao,Jie Ma,Kaihui Lin,Kai Ye,Miaohui Zhang,Pingyang Dai,Liujuan Cao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety interface between human instructions and physical flight. In human-scale urban airspace below 20 meters, thin geometry, occlusions, vegetation, and urban clutter define whether an aerial agent can safely enter the space ahead. However, existing UAV datasets mainly provide 2D annotations or 3D boxes, while driving-oriented occupancy benchmarks assume stable ground-level sensor rigs. Both miss the defining regime of low-altitude flight: a front-facing monocular camera observing occupied and free space from a moving aerial body with frame-wise changing 6-DoF pose and camera extrinsics. To bridge this gap, we introduce \textbfSkyShield, to the best of our knowledge the first front-view monocular semantic occupancy benchmark for urban UAV flight below 20 meters. Built on CARLA, SkyShield contains 36K front-view UAV samples across diverse urban scenes and weather conditions, pairing each image with frame-wise 6-DoF UAV pose, frame-wise dynamic camera geometry, UAV states, and front-frustum semantic occupancy labels. We further propose \textbfKAR-mIoU, a UAV-centric and dynamics-aware metric that re-weights voxel-level evaluation by kinematic reachability and time-to-collision, revealing safety-critical risks hidden by conventional mIoU. To tackle this challenging new setting, we provide \textbfSkyOcc, a geometry-first monocular baseline that integrates frame-wise UAV attitude into projection, fuses temporal occupancy features, and applies safety-prior optimization to preserve sparse collision-critical structures. Together, SkyShield, KAR-mIoU, and SkyOcc establish occupancy as a safety interface for low-altitude aerial autonomy. Code and dataset will be released publicly.
[CV-228] Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders
链接: https://arxiv.org/abs/2606.00746
作者: Yitong Jiang,Hongjun Wang,Collin McCarthy,Hanrong Ye,David Wehr,Xinhao Li,Qi Dou,Tianfan Xue,Ka Chun Cheung,Simon See,Wonmin Byeon,Ke Chen,Kai Han,Jinwei Gu,Hongxu Yin,Pavlo Molchanov,Jan Kautz,Sifei Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40–52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2606.00746 [cs.CV] (or arXiv:2606.00746v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-229] SORA: Free Second-Order Attacks in Fast Adversarial Training ICML2026
链接: https://arxiv.org/abs/2606.00738
作者: Mazdak Teymourian,Ramtin Moslemi,Farzan Rahmani,Mohammad Hossein Rohban
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICML 2026
Abstract:Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at this https URL.
[CV-230] CASTLE2026 Team WDL Technical Report
链接: https://arxiv.org/abs/2606.00712
作者: Zhengyang Li,Zhenglin Du,Yi Wen,Fang Liu,Shuo Li,Xu Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages
Abstract:The CASTLE Challenge @ EgoVis 2026 evaluates long-form egocentric video question answering over 600+ hours of multi-perspective recordings. Each four-choice question requires evidence from videos, transcripts, auxiliary photos, people, days, rooms, and temporal context. We propose an evidence-aware multimodal reasoning pipeline based on Qwen. Our system parses question hints, retrieves ASR chunks, attaches auxiliary images, samples candidate video frames, and routes questions into static visual, speech/text, temporal, and mixed types with specialized prompts. Multiple inference passes are aggregated by confidence-weighted voting and converted into the official Codabench format. In ablation, LoRA improves the score from 0.21 to 0.50, and more sampled frames further raise it to 0.58. Our final system ranks first in the CASTLE Challenge @ EgoVis 2026.
[CV-231] CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval
链接: https://arxiv.org/abs/2606.00706
作者: Md Aminur Hossain,Ayush V. Patel,Nitant Dube,Biplab Banerjee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages
Abstract:Cross-modal remote sensing image retrieval aims to retrieve semantically related scenes across heterogeneous sensing modalities. This remains challenging because paired observations may differ substantially in imaging physics, spatial resolution, spectral configuration, and visual appearance. Moreover, a single retrieval projection trained with one objective may be insufficient to jointly support cross-modal semantic alignment and same-modal neighbourhood preservation. We propose CR-JEPA, a Cross-modal Retrieval Joint-Embedding Predictive Architecture for dual-modality remote sensing retrieval. The model uses modality-specific stems, a shared transformer trunk, and JEPA-style predictive objectives to estimate masked latent target features within and across modalities. Inspired by LeJEPA, we apply Sketched Isotropic Gaussian Regularization to raw retrieval projections to stabilize embeddings and mitigate collapse. CR-JEPA further employs a decoupled-head design with a unified retrieval head for same-modal retrieval and a cross-modal retrieval head for cross-modal search. We evaluate CR-JEPA on BEN-14K, CBRSIR_VS, and DSRSID. On BEN-14K, CR-JEPA improves S1 to S2 retrieval from 61.23% to 75.82% and S2 to S1 retrieval from 63.73% to 75.40% over X-JEPA, while also achieving competitive same-modal retrieval with fewer parameters.
[CV-232] VICR: Visual In-Context Restoration for Real-World Image Super-Resolution
链接: https://arxiv.org/abs/2606.00704
作者: Qichang Zhang,Hailong Wang,Baiang Li,Linhao Wang,Rong Fu,Erkang Cheng,Simon James Fong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 28 pages, 11 figures, 9 tables
Abstract:Real-world image super-resolution (Real-ISR) requires balancing structural fidelity to degraded observations with realistic detail synthesis. However, existing generative Real-ISR methods often rely on entangled conditioning mechanisms, leading to structural drift or semantically inconsistent details. To address this issue, we propose Visual In-Context Restoration (VICR), a Diffusion Transformer (DiT)-based framework that formulates Real-ISR as image completion. Specifically, we introduce a decoupled visual prior injection mechanism that derives local and global cues from the low-quality (LQ) image: local cues help recover image structures and support high-frequency detail synthesis, while global cues guide overall generation and promote semantic consistency. For ambiguous regions under severe degradation, VICR employs an inference-time agent to refine semantic prompts using visual evidence from the LQ input while keeping model parameters fixed. Experiments show that VICR achieves state-of-the-art performance across multiple Real-ISR benchmarks with only 127M trainable parameters.
[CV-233] FROST-STA: Frozen Dense Features for the Ego4D Short-Term Object Interaction Anticipation
链接: https://arxiv.org/abs/2606.00694
作者: Chaoyang Wang,Lexuan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Short-term anticipation in egocentric video requires more than recognizing the current scene: a system must infer which object the camera wearer will contact, which action will follow, and how soon the contact will happen. This report describes FROST-STA, our submission to the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. For each query time, the model produces a ranked set of structured hypotheses containing an active-object box, noun label, verb label, time-to-contact (TTC), and confidence. FROST-STA builds on the V-JEPA 2.1 STA evaluation protocol, but adapts it to the challenge by using object-centric decoding, multi-head prediction, and a submission-oriented training and ensembling recipe. We keep the V-JEPA 2.1 ViT-G backbone fixed and extract two dense token streams: video tokens from a short clip resized to 384 pixels before the query, and image tokens from the last observed high-resolution frame. A compact alignment module, consisting of an attentive probe and frame-guided temporal pooling, maps the clip representation onto the spatial reference of the final frame before fusing it with image features. The fused maps are decoded by Faster R-CNN-style STA heads that estimate box offsets, nouns, verbs, TTC values, and interaction quality. For the final leaderboard entry, we train for 25 epochs with the official training split plus additional permitted validation annotations, and combine predictions across eight heads and checkpoints from epochs 15-25. FROST-STA obtains 5.13 Overall Top-5 mAP on the official test server, ranking second in the challenge and showing that frozen dense image-video features can serve as a strong basis for object-level interaction forecasting.
[CV-234] Wavelet-Fusion Diffusion Model for Multimodal Brain MRI Synthesis with Modality and Metadata Conditioning
链接: https://arxiv.org/abs/2606.00689
作者: Muhammad Nabi Yasinzai,Remika Mito,Mangor Pedersen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 51 pages, 7 figures, including supplementary material. Submitted to Imaging Neuroscience
Abstract:Multimodal MRI provides complementary information for neuroimaging analysis, where different imaging modalities capture distinct anatomical, tissue, and pathological features that support the development and evaluation of downstream AI applications. Although large-scale structural MRI resources are increasingly available, their modality coverage is often uneven across public and pooled neuroimaging datasets. This uneven modality coverage is further complicated by heterogeneity across sites, scanners, and acquisition protocols, as well as demographic and clinical variables that are often sparse, inconsistently recorded, or unavailable across studies. Synthetic MRI generation can help address this imbalance by synthesizing target-modality volumes for dataset augmentation and controlled synthetic cohort creation. However, many existing MRI synthesis approaches are trained on narrow modality sets or relatively homogeneous cohorts, limiting their applicability to large pooled neuroimaging resources where modality availability, acquisition protocols, and metadata coverage vary substantially across datasets. Diffusion models have become an attractive approach for MRI synthesis because of their strong sample fidelity and diversity, but sampling directly in 3D voxel space is computationally expensive and slow at inference. Latent diffusion improves practicality by synthesizing MRI in a learned, 3D latent space, although generation quality depends on the autoencoder’s reconstruction fidelity and the resulting latent distribution. Our approach combines a Wavelet-Fusion variational autoencoder (WF-VAE) latent compressor with a conditional 3D U-Net diffusion model trained in the learned latent space using explicit modality and metadata conditioning. Our proposed Wavelet-Fusion Diffusion Model (WFDM) achieved the strongest distributional alignment among the evaluated synthetic MRI generators.
[CV-235] Shape-Prior-Based Point Cloud Completion for Single-Stage Fully Sparse 3D Object Detection
链接: https://arxiv.org/abs/2606.00688
作者: Kaizheng Wang,Mingqian Ji,Jian Yang,Shanshan Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-stage fully sparse 3D object detectors rely on point clouds data to detect objects in autonomous driving scenarios. However, the sparsity and incompleteness of point clouds significantly limit the performance of 3D object detection. To address this issue, this paper proposes a point clouds completion method specifically designed for single-stage fully sparse detectors. The entire shape-prior-based completion process consists of two consecutive steps. In the first step, we design a novel Instance Selection module, which is capable of identifying point clouds corresponding to foreground objects even when the baseline model does not generate proposals, while effectively ignoring the point clouds of background regions. In the second step, we introduce a novel Alignment-Based Point Completion module, which aligns the point clouds of foreground objects with prototypes in terms of both their centers and orientations. Subsequently, points are selected from the prototype to fill in the missing parts of the foreground object. We evaluated our method on two single-stage fully sparse detectors using the KITTI dataset. The experimental results demonstrate that the proposed method significantly improves the detection performance, confirming its effectiveness and generalizability.
[CV-236] A Modelling and Evaluation Framework for EuroCrops-Driven Sentinel-2 Crop Segmentation
链接: https://arxiv.org/abs/2606.00676
作者: Alexandra Nicoleta Scarlat,Ioana Cristina Plajer,Alexandra Baicoianu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This work presents a configurable pipeline for generating semantic-segmentation-ready agricultural datasets from Sentinel-2 imagery and EuroCrops parcel-level annotations. The workflow transforms heterogeneous vector crop annotations into aligned multispectral image–mask pairs through label harmonization, Sentinel-2 product selection, spatial alignment, rasterization, patch extraction, quality filtering, and class-aware sample selection. The generated dataset contains 67,337 patches from five European countries and uses a reduced taxonomy of ten crop classes plus background. A four-level U-Net with Group Normalization was trained using 10 Sentinel-2 spectral bands and a composite loss combining class-weighted cross-entropy and Dice loss. On the internal EuroCrops-based test split, the model achieved a mean Intersection over Union (mIoU) of 0.7665, a pixel accuracy of 0.8693, and a mean class accuracy of 0.9072. Compared with spectral and spatial-context Random Forest baselines, the U-Net showed the importance of learned multi-scale spatial representations for crop segmentation. External evaluation was performed on unseen Belgian EuroCrops subsets, DACIA5, and PASTIS. The results show a clear performance gap under external and cross-dataset evaluation, especially for benchmarks with different taxonomies, annotation protocols, spatial coverage, or temporal organization. The model transfers more reliably to dominant and taxonomically aligned classes such as maize and wheat, while performance remains limited for several minority classes and for the adapted single-date PASTIS setting. These findings highlight both the potential and the limitations of using EuroCrops-derived supervision for Sentinel-2 crop segmentation under realistic domain shifts. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.00676 [cs.CV] (or arXiv:2606.00676v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-237] -CLIP: Enabling Thermal Perception for Contrastive Language-Image Pretraining
链接: https://arxiv.org/abs/2606.00673
作者: Tayeba Qazi,Ayush Maheshwari,Prerana Mukherjee,Brejesh Lall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 34pages (including references and appendix), 13 figures
Abstract:Thermal imaging offers a powerful alternative to visible-spectrum vision under challenging conditions such as low illumination and adverse weather, yet foundational vision-language models like CLIP fail to align thermal images with textual descriptions due to a fundamental thermal perception gap. We identify three major challenges: the lack of captioned thermal datasets, the inability of standard LLMs to reason about thermal phenomena, and a key representational challenge in thermal imaging where global scene context and object-level heat signatures conflict when learned together in a single embedding space. To address these, we introduce IR-Cap, the first physics-aware thermal captioning pipeline and dataset providing complementary global and fine-grained thermal descriptions across three public benchmarks, and T-CLIP, a decoupled dual-LoRA framework that independently adapts CLIP for scene-level and object-level thermal understanding. T-CLIP achieves consistent improvements over all baselines across three thermal benchmarks in cross-modal retrieval, and we provide an exploratory demonstration of its applicability to text-conditioned thermal image generation.
[CV-238] SKIP: Sparse Keyframe Interpolation Paradigm for Efficient Embodied World Models
链接: https://arxiv.org/abs/2606.00664
作者: Ziheng He,Yixiang Chen,Ning Yang,Zhanqian Wu,Qisen Ma,Yuan Xu,Jiabing Yang,Peiyan Li,Xiangnan Wu,Xiaofeng Wang,Zheng Zhu,Jing Liu,Nianfeng Liu,Yan Huang
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 10 figures
Abstract:Embodied world models have emerged as a promising paradigm in robotics by predicting how robot actions affect the surrounding scene. However, the rollout inference remains computationally expensive in pixel space, as long-horizon manipulation videos typically have to be generated frame by frame. This cost cannot be easily reduced by indiscriminately dropping frames, since downstream policies rely on complete preservation of sparse task-relevant events such as approach, contact, grasp, and release. To address this challenge, we propose Sparse Keyframe Interpolation Paradigm (SKIP), an event-preserving sparse-to-dense framework that avoids dense frame-by-frame generation. SKIP first identifies task-relevant keyframes by leveraging robot-aware multimodal features. It then synthesizes only these keyframes with a sparse video diffusion model. A learned gap predictor and an action-conditioned interpolator subsequently reconstruct the missing intervals according to the robot actions. On LIBERO, SKIP generates dense rollouts 4.16\times faster than a dense baseline while improving visual fidelity and reducing aggregate FVD by 89.0% . Importantly, SKIP-generated videos are effective policy-training data. Even when they fully replace real demonstrations, \pi_0.5 success drops only 1.3 pp in LIBERO simulation and 6.7 pp on the real robot, whereas fully dense frame-by-frame generation collapses by 48 to 58 pp.
[CV-239] AP-JEPA: Frozen Future-Latent Probing and Two-Stage Score Fusion for EPIC-KITCHENS-100 Action Anticipation ICIP CVPR
链接: https://arxiv.org/abs/2606.00662
作者: Chaoyang Wang,Lexuan Xu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The runner-up solution for the Action Anticipation Challenge, EPIC-KITCHENS-100 at the CVPR EgoVis Workshop 2026
Abstract:This report presents TAP-JEPA, our runner-up submission to the EPIC-KITCHENS-100 (EK-100) Action Anticipation Challenge at EgoVis 2026. The task is to anticipate the next verb, noun, and verb-noun action from an egocentric clip that ends before the target action begins. Instead of fine-tuning a large video backbone, TAP-JEPA builds a compact anticipation model on frozen V-JEPA 2.1 features: a ViT-G/384 encoder extracts visible pre-action tokens, the pre-trained latent predictor estimates near-future tokens from the observed context, and both token groups are fused by attentive probes with task-specific queries for verbs, nouns, and action pairs. For the final submission, we expand supervised training with the official training split and most of the validation split, reserving a small subset for sanity checks and qualitative inspection, and adopt a two-stage score fusion that first averages eight independently initialized probe replicas within each epoch and then merges candidates from epochs 12-20 with field-dependent weights. On the official open-testing leaderboard, our sunshinesky entry achieves 27.91 percent overall action Mean Top-5 Recall (MT5R), ranking second and only 0.04 percentage points behind the top score.
[CV-240] Collaborative Few-Step Distillation and Low-Bit Quantization for Wan2.2 Dual-Expert Video Diffusion Models
链接: https://arxiv.org/abs/2606.00658
作者: Jinyang Du,Shenghao Jin,Ziqian Xu,Ruihao Gong,Shiqiao Gu,Yang Yong,Jinyang Guo,Xianglong Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large video diffusion models achieve strong visual quality but remain expensive to deploy because each sample requires many denoising steps and a large resident parameter footprint. This paper studies a deployment-oriented compression pipeline for Wan2.2-T2V-A14B by combining few-step distribution-matching distillation with low-bit quantization. The pipeline follows the model’s dual-expert denoising route, calibrates the high-noise and low-noise branches separately, protects sensitive entrance layers, and uses HiF4-style low-bit representation to improve dynamic-range coverage. Quantization is calibrated on the distilled few-step student rather than on the original long-step trajectory, reducing activation-distribution mismatch during inference. The proposed co-design keeps the quantized model close to the same-step full-precision model and surpasses the original full-precision baseline at 8 and 20 steps on average. The 20-step setting gives the best quality-efficiency trade-off in the tested configurations.
[CV-241] An Attribute-Based Measure of Video Complexity
链接: https://arxiv.org/abs/2606.00640
作者: Aditya Sarkar,Yi Li,Zihao Wang,Jiacheng Cheng,Sai Vidyaranya Nuthalapati,Aashu Singh,Shlok Kumar Mishra,David Jacobs,Nuno Vasconcelos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A new framework for the estimation of the complexity posed by video-question pairs to video-LLMs, Video Attribute-Based Complexity (VideoABC), is proposed. Video complexity is defined as the probability of failure of a video-LLM for a given video-question pair. VideoABC is a non-parametric complexity measure, using a reference video dataset and a pre-defined vocabulary of video attributes informative of complexity, \eg the scene complexity or the speed of the video event informative of the question. In a training phase, reference videos are projected into the space of these attributes, which is then quantized. The expected ABC of each quantization cell is then computed. Given a new video and its projection into the attribute space, complexity is estimated by the expected ABC of the associated quantization cell. To enable the use of VideoABC with small reference video datasets, two quantizers are combined: a k-means quantizer that enables accurate complexity estimates for samples in the distribution of the reference dataset and a universal lattice quantizer that guarantees generalization to out-of-distribution samples. A synthetic video generation procedure, inspired by target-distractor manipulations of psychophysics studies, is proposed to populate the cells of the lattice quantizer during training, enabling the computation of their expected ABCs. Experimental results show that VideoABCis effective even with very low-dimensional attribute representations, substantially outperforming approaches like `video-LLM as judge’ with much less complexity. Finally, the explainable nature of the VideoABC score, in terms of well-defined attributes, is shown to provide insights on how the attribute composition of benchmarks affects their complexity.
[CV-242] A Systematic Benchmark of Intraoperative Ultrasound-to-MR Synthesis for Brain Tumour Surgery
链接: https://arxiv.org/abs/2606.00630
作者: Olga Esteban-Sinovas,Santiago Cepeda,Ignacio Arrese,Rosario Sarabia
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Intraoperative ultrasound (ioUS) is a versatile, cost-effective modality in brain tumour surgery, but its interpretation is difficult: acquisition planes are non-standard, artefacts are modality-specific, and its appearance differs markedly from the preoperative MRI on which surgical-planning tools, segmentation models and the surgeon’s experience rely. Synthesising MRI-like images from ioUS could let this MRI-based infrastructure be reused intraoperatively without an extra scan. Most prior work evaluates a single architecture in isolation; to our knowledge, no benchmark has spanned architectural paradigms, inference regimes and downstream-task endpoints under a common protocol. We address this gap on the public ReMIND data set (76 patients; 153 paired ioUS/T2w and 104 paired ioUS/FLAIR studies; 60/16 patient-level train/held-out split). Six generators (four GAN baselines: Pix2Pix, SwinPix2Pix, CycleGAN, CUT; the transformer-augmented ResViT; and the few-step diffusion model SynDiff) were each trained under four inference regimes (2D, 2.5D, 2D + 3D-refinement, full-3D) and two targets (T2w only; T2w + FLAIR multi-task), yielding 48 experiments. Image-fidelity metrics (SSIM, PSNR, MAE, LPIPS) were complemented by an nnU-Net v2 downstream segmentation evaluation (tumour and resection cavity) and by subgroup analyses by histological grade and reoperation. No architecture dominated every axis, and, critically, perceptual quality tracked downstream utility most closely (LPIPS, r=-0.66, p0.001), whereas higher SSIM was associated with worse utility (r=-0.64, p0.001); SynDiff-2.5D best preserved downstream segmentation (U_Dice=0.55). Perceptual and downstream-task metrics should therefore be reported alongside or in preference to global SSIM, and architecture choice conditioned on surgical phase, patient history and clinical objective.
[CV-243] MM-Snowball: Evaluating and Mitigating Hallucination Snowballing in Multimodal Multi-Turn Dialogue ICML2026
链接: https://arxiv.org/abs/2606.00622
作者: Yue Jiang,Xue Jiang,Lihua Zhang,Zhiqiang Wang,Yuhang Lu,Peng Wang,Bo Han,Feng Zheng,Dingkang Yang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The International Conference on Machine Learning (ICML 2026)
Abstract:Multimodal large language models (MLLMs) demonstrate remarkable visual understanding, yet their reliability in interactive settings is severely undermined by hallucination snowballing: a phenomenon where initial errors amplify across conversational turns, leading to a collapse in coherence. This failure reveals a fundamental vulnerability where models progressively neglect visual grounding in favor of over-relying on polluted textual history. Existing benchmarks are predominantly confined to single-turn VQA, which fail to capture the complex dynamics of error propagation in long-horizon interactions. To address this, we introduce MM-Snowball, the first benchmark for fine-grained diagnosis of hallucination snowballing within dialogues. Extensive evaluation shows that our benchmark poses a significant challenge even to advanced MLLMs and reveals the inefficacy of existing mitigation methods designed for single-turn VQA. To counteract this degradation, we propose Conflict-Aware Visual Rectification (CAVR). This training-free method mitigates snowballing through a synergistic dual-mechanism that refreshes visual grounding at the representation level and rectifies output distributions at the logit level, effectively re-anchoring the model to visual facts. Experiments demonstrate that CAVR achieves state-of-the-art performance, offering a promising path toward more reliable interactive AI. Data and code are available at: this https URL
[CV-244] FlowNar: Scalable Streaming Narration for Long-Form Videos ICML2026
链接: https://arxiv.org/abs/2606.00620
作者: Zeyun Zhong,Manuel Martin,Chengzhi Wu,David Schneider,Frederik Diederichs,Juergen Gall,Juergen Beyerer
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICML 2026
Abstract:Recent Large Multimodal Models (LMMs), primarily designed for offline settings, are ill-suited for the dynamic requirements of streaming video. While recent online adaptations improve real-time processing, they still face critical scalability challenges, with resource demands typically growing at least linearly with video duration. To overcome this bottleneck, we propose FlowNar, a novel framework for scalable streaming video narration. The core of FlowNar is a dynamic context management strategy for historical visual context removal, combined with our CLAM (Cross Linear Attentive Memory) module for streaming visual history retention, ensuring bounded visual memory usage and computational complexity, crucial for efficient streaming. We also introduce a realistic self-conditioned evaluation protocol and complementary evaluation metrics to assess streaming narration models under deployment-like conditions. Experiments on the Ego4D, EgoExo4D, and EpicKitchens100 datasets demonstrate that FlowNar substantially improves narration quality over strong baselines while being highly efficient, supporting processing of 10 \times longer videos and achieving 3 \times higher throughput (FPS). The code is available at this https URL.
[CV-245] Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion
链接: https://arxiv.org/abs/2606.00616
作者: Shivam Singh,Saptarshi Majumdar,Pratik Prabhanjan,Zicheng Liu,Emad Barsoum
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.
[CV-246] FiSeR: Fine-Grained Source Representations for Cross-Domain AI Image Detection
链接: https://arxiv.org/abs/2606.00606
作者: Shan Zhang,Yongxin He,Mingming Zhang,Huiwen Tian,Lei Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world synthetic image detectors often generalize poorly under domain shift despite strong in-domain performance. Using unsupervised UMAP projections, we find that natural and synthetic features remain partially separable on unseen datasets, yet performance still drops, suggesting that the classification head overfits to training-domain artifacts. Therefore, the key is to learn more transferable representations so that the decision criterion is more stable and robust to domain shifts. Based on the structural fact that synthetic images are produced by diverse generators, we propose a hierarchical contrastive learning framework that improves the separability between natural and synthetic images while preserving generator identity information. It jointly optimizes (i) a coarse contrastive objective between natural and synthetic images and (ii) a fine contrastive objective among synthetic images using generator identities. Trained on WildFake, our method achieves an average AUROC gain of +10.22 on cross-domain evaluation over Chameleon, AIGIBench, Community Forensics, and GenImage under the same settings as the strong baseline DIRE. For few-shot adaptation, we freeze the backbone and fit an SVM head on 10 labeled samples per class, improving AUROC by +10.64 on AIGIBench and +17.41 on Chameleon, averaged over 12 widely used detectors. Our code is publicly available at: this https URL.
[CV-247] ASAP: Advancing Medical Volumetric Representation Learning with Anatomy-aware Semantically-adaptive Pre-training MICCAI2025
链接: https://arxiv.org/abs/2606.00602
作者: Rongsheng Wang,Fenghe Tang,Zihang Jiang,Yingtai Li,Xu Zhang,Haoran Lai,Wenxin Ma,Wei Wei,Zhiyang He,Xiaodong Tao,Rui Yan,Qingsong Yao,Shaohua Kevin Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025 extention
Abstract:Learning transferable and interpretable representations from medical volumetric scans remains challenging due to complex anatomical structures and weak, heterogeneous supervision provided by radiology reports. In this paper, we propose Anatomy-aware Semantically-Adaptive Pre-training (ASAP), a principled vision-language pre-training framework for fine-grained medical volumetric representation learning from large-scale chest CT scans and their corresponding radiology reports. ASAP integrates three key components: (1) an anatomy-aware knowledge injection module that incorporates organ-level structural priors via off-the-shelf segmentation tool to encourage anatomically coherent representations; (2) a semantically-adaptive selective alignment mechanism that dynamically associates sentence-level findings with localized volumetric regions; and (3) a semantically-adaptive fusion module for effective interaction between anatomically informed visual features and grounded textual cues under dual-modal masked modeling paradigm. Beyond methodological contributions, we establish a comprehensive benchmark for medical volumetric vision-language pre-training on chest CT, covering 15 datasets and 22 downstream tasks spanning abnormality classification, segmentation, disease prognosis prediction, report generation, vocabulary classification, cross-modal retrieval and visual question answering. This benchmark provides standardized evaluation protocols to systematically assess representation quality under diverse clinical settings and data regimes. Extensive experiments demonstrate that ASAP consistently achieves state-of-the-art performance across tasks and datasets, with particularly pronounced gains under limited supervision and distribution shift, validating its effectiveness in learning transferable and clinically meaningful volumetric representations.
[CV-248] hrough the PRISM: Principle-Aware Interpretable and Multi-Scale Evaluation of Visual Designs
链接: https://arxiv.org/abs/2606.00592
作者: Mona Gandhi,KJ Joseph,Srinivasan Parthasarathy,Sayan Nag
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision. To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement. Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.
[CV-249] Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting
链接: https://arxiv.org/abs/2606.00588
作者: Phuoc-Nguyen Bui,Van-Vi Vo,Duc-Tai Le,Van-Nguyen Pham,Ki-Young Kim,Seung-Young Yu,Hyunseung Choo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Long-term visual acuity (VA) outcomes after anti-VEGF therapy are central to patient counseling, expectation setting, and follow-up planning in diabetic macular edema (DME). However, in clinical practice, physicians must often estimate long-term visual trajectories based only on early post-treatment findings, making reliable prognostication difficult. Although prior OCT-based learning approaches have largely focused on short-term response or single-endpoint prediction, modeling VA trajectories across multiple future time points from early longitudinal observations remains insufficiently explored. In this study, we assembled a real-world cohort of 188 anti-VEGF-treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that integrates structural features from baseline and month-1 OCT with the tabular variables to capture baseline disease status and early treatment response. ReVA uses spatial attention to preserve localized prognostic imaging features and a dependency-aware tabular encoder to model interactions among clinical variables. These multimodal representations are fused to predict patient-specific long-term visual acuity trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management.
[CV-250] Improving Visual Representation Alignment Generation with GRPO
链接: https://arxiv.org/abs/2606.00583
作者: Shentong Mo,Sukmin Yun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:
Abstract:Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA’s static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.
[CV-251] On the Difficulty of Learning a Meta-network for Training Data Selection
链接: https://arxiv.org/abs/2606.00571
作者: Zilin Du,Junqi Zhao,Boyang Albert Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.
[CV-252] DeepLatent: Think with Images via Parallel Latent Visual Reasoning
链接: https://arxiv.org/abs/2606.00562
作者: Dongchen Lu,Zhimo Li,Mao Shu,Huo Cao
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The emerging paradigm of “thinking with images” embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.
[CV-253] Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting CVPR2026
链接: https://arxiv.org/abs/2606.00556
作者: Panav Shah,Geet Sethi,Ashutosh Gandhe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CVPR 2026 Workshop MORSE
Abstract:Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations. Additionally, we explore an ensemble strategy based on majority voting across six diverse grounding pipelines, each with distinct capabilities. This multi-model framework improves robustness and significantly enhances localization accuracy. Experimental results demonstrate that the proposed pipelines and ensemble approach outperform individual models, leading to more reliable and precise visual grounding predictions.
[CV-254] CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery CVPR
链接: https://arxiv.org/abs/2606.00548
作者: Oishee Bintey Hoque,Nibir Chandra Mandal,Mandy L Wilson,Samarth Swarup,Madhav Marathe,Abhijin Adiga
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at CVPR Workshop-2026. First two authors has equal contribution
Abstract:Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.
[CV-255] ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs
链接: https://arxiv.org/abs/2606.00543
作者: Yiling Gao,Hongchen Wei,Zhenzhong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.
[CV-256] An Effective Solution for the CVPR 2026 8th UG2 Challenge Track 3: Dynamic Object Segmentation in Turbulence CVPR2026
链接: https://arxiv.org/abs/2606.00522
作者: Hongzhen Li,Miao Yu,Leilei Cao,Youwei Pan,Yingfang Zhu,Fengjie Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 1 figure, CVPR 2026 8th UG2+ Challenge Track 3
Abstract:In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model’s robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.
[CV-257] Generate in Reconstruction Space Match in Semantic Space: Transport Geometry for One-Step Generation
链接: https://arxiv.org/abs/2606.00514
作者: Hugues Van Assel,Edward De Brouwer,Saeed Saremi,Gabriele Scalia,Aviv Regev
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 4 figures
Abstract:Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39 \times reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at this https URL.
[CV-258] Saliency-Aware Model Merging ICML2026
链接: https://arxiv.org/abs/2606.00511
作者: Jungin Park,Jiyoung Lee,Kwanghoon Sohn
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ICML 2026 Camera-ready
Abstract:Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.
[CV-259] Structure-Aware Consistency Priors for Shape from Polarization in Complex Media
链接: https://arxiv.org/abs/2606.00509
作者: Kaimin Yu,Puyun Wang,Huayang He,Xianyu Wu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recovering surface normals from single view polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01 deg, which is 2.74 deg lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.
[CV-260] V-LynX: Token Interface Alignment for VideoX LLM s ICML2026
链接: https://arxiv.org/abs/2606.00508
作者: Jungin Park,Jiyoung Lee,Kwanghoon Sohn
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML 2026 Camera-ready
Abstract:This study introduces an intriguing phenomenon in Video LLMs: rather than merely translating frames into textual embeddings, Video LLMs establish a continuous manifold, token interface, allowing visual tokens to operate as standalone entities within the architecture. Exploiting this discovery, we propose V-LynX, a scalable framework that integrates novel modalities into Video LLMs by repurposing the internalized interface. Departing from conventional paradigms that necessitate heavy modality-specific encoders or paired supervision, V-LynX employs a lightweight auxiliary pathway in parallel with the frozen vision encoder. Our method integrates new sensory inputs with intrinsic video priors by aligning both attention responses and statistical distributions using unpaired unimodal data sets. This ensures manifold compatibility while preserving the integrity of the Video LLMs. Extensive benchmarks demonstrate that V-LynX achieves SOTA and efficiency across audio-visual QA, 3D reasoning, high-frame-rate, and multi-view video understanding. The code is available at this https URL.
[CV-261] OptiWorld: Optimal Control for Video World Generation under Physical Constraints
链接: https://arxiv.org/abs/2606.00499
作者: Yu Yuan,Jianhao Yuan,Xijun Wang,Daiqing Li,Liu He,Lu Ling,Stanley H. Chan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Porject Page: this https URL
Abstract:Video generation models are becoming a scalable form of world models, but they mainly generate plausible motion rather than proactively control or optimize the underlying dynamics. As a result, an object in the generated video may follow trajectories that are unsafe, not smooth, inefficient, or physically inconsistent. In this work, we propose \textbfOptiWorld, a framework that brings classical optimal control into video generation at inference time. OptiWorld first extracts a compact, task-relevant world state, then plans an optimal trajectory under physical constraints, and finally renders the video conditioned on this trajectory. We formulate planning as a geometric problem on a continuous manifold, which converts 3D geometry and task-dependent physical constraints into a unified planning geometry. By adding this optimal-control layer, OptiWorld generates videos with preferable dynamics, demonstrating strong potential in multiple tasks including goal-conditioned image-to-video generation, video dynamics editing, and counterfactual generation.
[CV-262] Pre-Deployment Robustness Stress Testing for CT Segmentation Systems Using Clinically Motivated Multi-Corruption Augmentation
链接: https://arxiv.org/abs/2606.00491
作者: CholMin Kang,Jonghyun Chung,Amanpreet Kaurb,Nagesh Gulkotwarb,Arthi Sivasankaranb
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning-based CT segmentation systems often achieve high accuracy on clean benchmark images, but their performance may degrade under heterogeneous clinical imaging conditions such as noise, resolution loss, contrast variation, intensity shift, and artifacts. This instability can limit reliable deployment in real-world medical imaging workflows. We propose Robustness via Augmented Multi-corruption Pipeline (RAMP), a robustness-oriented augmentation framework for CT segmentation. RAMP combines anatomically constrained spatial perturbations, CT intensity transformations, and stochastic multi-corruption composition to expose models to clinically plausible image degradation during training. Across two CT segmentation evaluation settings, RAMP achieved the strongest corrupted-image performance and the smallest clean-to-corrupted robustness gap. In the five-organ noisy evaluation benchmark, RAMP improved mean corrupted Dice from 0.610 to 0.753 and reduced the robustness gap from 0.264 to 0.064 compared with the nnU-Net baseline. In Abdomen1K, RAMP improved mean corrupted Dice from 0.633 to 0.789 and reduced the robustness gap from 0.290 to 0.070. Although RAMP did not achieve the highest clean-image Dice, it substantially mitigated worst-case segmentation collapse under severe image degradation. These results suggest that multi-corruption augmentation can serve as a practical pre-deployment strategy for improving the reliability of CT segmentation systems in heterogeneous clinical environments. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00491 [cs.CV] (or arXiv:2606.00491v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00491 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Cholmin Kang [view email] [v1] Sat, 30 May 2026 02:42:38 UTC (6,426 KB)
[CV-263] 3D Segment Anything Model with Visual Mamba for Diagnosing Placenta Accreta Spectrum
链接: https://arxiv.org/abs/2606.00489
作者: Yuliang Zhang,Fang He,Lulu Peng,Tianyu Yan,Pingping Zhang,Ting Song,Lili Du,Dunjin Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Image Processing (TIP2026). More modifications may be performed
Abstract:Placenta Accreta Spectrum (PAS) is a rare but highly dangerous obstetric disease. Early and accurate PAS diagnosis is critical for maternal health. Traditional PAS diagnosis relies on experienced doctors by analyzing the cesarean history and Magnetic Resonance Imaging (MRI) data. However, district-level hospitals often lack the expertise and resources for accurate PAS diagnosis. To address these challenges, we establish the first MRI-based PAS dataset, which includes both fine-grained segmentation and classification annotations. Meanwhile, diagnosing PAS can be significantly enhanced by segmenting lesion areas from MRI images of the uterus. To achieve automatic PAS diagnosis, we propose 3DSAMba, a novel feature learning framework for effective lesion segmentation. More specifically, we first design a 3D Segment Anything Model (SAM) and incorporate medical domain information into the model through an efficient adapter mechanism. In addition, we introduce a Multi-Level Aggregation Mamba (MLAM) to aggregate feature maps across different levels and a Fusion State Space Model (FSSM) to fuse multi-scale features from both the encoder and decoder. Finally, we apply segmentation masks to the original MRI images through element-wise multiplication, effectively isolating lesion areas for more accurate PAS diagnosis. Extensive experiments validate that our framework significantly improves the PAS diagnostic performance. To facilitate further research in PAS diagnosis, we have released the dataset and source code at this https URL.
[CV-264] MUSCLE-NET: Predicted-Multiscale-Aware Network for Pedestrian Trajectory Forecasting
链接: https://arxiv.org/abs/2606.00471
作者: Yu Liu,Ming Huang,Xiao Ren,Zhijie Liu,Youfu Li,He Kong
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This manuscript has been accepted to the IEEE Transactions on Intelligent Transportation Systems as a regular paper
Abstract:Accurate pedestrian trajectory prediction is essential for safe navigation in autonomous driving and intelligent transportation systems. Despite substantial progress made by recent methods, most existing approaches are limited in fully exploiting diverse observations and often overlook the scale dependency of future motion, treating multiscale features uniformly regardless of underlying motion dynamics. This limits their robustness across diverse pedestrian behaviors. To address these challenges, we propose a Predicted-MUltiSCale-Aware Network (MUSCLE-NET) for Pedestrian Trajectory Forecasting that integrates complementary multimodal cues with scale-adaptive prediction mechanisms. The proposed framework is built upon a Multiscale Multimodal Feature Extraction (MMFE) module, which combines multiscale representation, modality-aware recalibration, and directional cross-modal fusion to construct semantically aligned representations from bounding boxes, velocities, and pose information. Building on these features, a Multiscale Enhanced Hierarchical Prediction (MEHP) module performs prediction-aware future-motion refinement via a probabilistic coarse predictor, scale-aligned fusion, and progressive refinement, adaptively selecting scale-relevant cues to mitigate spatial drift. Extensive experiments on the JAAD and PIE benchmarks demonstrate that the proposed MUSCLE-Net achieves competitive performance and consistent gains compared with state-of-the-art trajectory prediction methods.
[CV-265] An explainable hierarchical self attention-based approach for tremor detection in the time domain ALT
链接: https://arxiv.org/abs/2606.00461
作者: Timothy Odonga,Jeanne M. Powell,Mark Saad,Richa Tripathi,Christine D. Esper,Stewart A. Factor,Hyeokhyen Kwon,J. Lucas Mckay
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: Submitted to PLOS Digital Health
Abstract:Tremor is a common movement disorder associated with conditions like Parkinson’s disease and Essential tremor, traditionally diagnosed through expert clinician assessment. Current automated detection methods rely on frequency-domain features informed by clinical expertise. In this work, we present an explainable, two-stage hierarchical framework for tremor detection in the time domain that learns tremor patterns directly from 3D kinematic marker time-series data across entire tremor-provoking trials. Our framework combined a deep convolutional and long short-term memory network to learn tremor representations from short, discrete, non-overlapping time segments of kinematic time series data from trials, which are then processed by a vision transformer that models their long-term temporal dynamics of time segment features for trial (session) level classification. Evaluated across nine body parts, the framework achieved F1-scores of 0.594 - 0.947 depending on body parts (average: 0.765), falling short of the frequency-domain state-of-the-art performance (0.909) while requiring minimal preprocessing. Attention weights and gradient-based class activation maps (Grad-CAM) identified time-domain features of tremor across body parts. This proof of concept demonstrated the feasibility of data-driven time-domain modeling for tremor detection across anatomically diverse body parts, while reducing reliance on expert-engineered spectral features and providing posthoc interpretability of temporal and anatomical patterns of tremor.
[CV-266] Beyond Static Gaussians: An Empirical Investigation of Architectural Paradigms for Dynamic 3D Scene Reconstruction
链接: https://arxiv.org/abs/2606.00452
作者: Adrian Ramlal,John S. Zelek
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)
Abstract:Dynamic scene reconstruction via 3D Gaussian Splatting (3DGS) has emerged as a compelling approach for representing evolving environments, yet understanding trade-offs between methodologies remains crucial. This paper presents a comprehensive analysis of dynamic 3DGS methods, categorizing them into two paradigms: structure-guided methods employing auxiliary representations (deformation fields, canonical spaces, grids) to model temporal changes, and gaussian-centric methods encoding dynamics directly into primitives via continuous functions or 4D representations. We evaluate representative methods from both paradigms on the D-NeRF benchmark. Our findings reveal that structure-guided methods achieve superior reconstruction fidelity and compact model sizes, while gaussian-centric approaches demonstrate significantly higher rendering speeds enabling real-time performance, though with greater quality variability and potentially substantial storage overhead. This analysis highlights a fundamental trade-off between reconstruction quality/compactness versus rendering speed, providing insights to guide future research and application development in dynamic scene reconstruction.
[CV-267] Optimizing 3D Gaussian Splatting via Point Cloud Upsampling
链接: https://arxiv.org/abs/2606.00450
作者: Adrian Ramlal,Yan Song Hu,John S. Zelek
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accepted in Journal of Computational Vision and Imaging Systems (JCVIS)
Abstract:3D Gaussian Splatting (3DGS) is a technique for creating and rendering 3D scenes, however its performance depends heavily on the quality of initial seed points. To improve 3DGS initialization, this study presents and evaluates several point cloud upsampling approaches: linear interpolation, triangular interpolation, spline-based surface reconstruction, moving least squares surface fitting, and Voronoi-based point generation. Additionally, this research introduces a depth-guided point lifting method that leverages depth maps to maintain geometric consistency with Structure-from-Motion (SfM) reconstructions. Through extensive experiments on the Mip-NeRF360 and Replica datasets, the proposed methods demonstrate improvements in reconstruction quality across diverse scene types. Results indicate that different upsampling strategies excel in different scenarios: surface reconstruction methods perform better with organic, detailed scenes, while simpler interpolation approaches are more suited for scenes dominated by piecewise-smooth geometries. In comparison, the depth-guided approach shows promise for adding geometry-aware points across the entire scene, importantly in texture-less regions. These findings, which provide preliminary practical guidelines for selecting appropriate upsampling methods based on scene characteristics and computational constraints, advances the understanding of how point cloud initialization affects 3DGS quality.
[CV-268] GeoSAM-3D: Geodesic Prompt Propagation for Open-Vocabulary 3D Scene Segmentation from Monocular Video
链接: https://arxiv.org/abs/2606.00447
作者: Arun Sharma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Open-vocabulary 3D scene segmentation usually assumes RGB-D video, calibrated multi-view imagery, or a reconstructed mesh. GeoSAM-3D studies a lighter setting: a user uploads a short monocular video, clicks or names an object in one frame, and receives a propagated 3D mask over a Gaussian scene. The implementation combines frozen image and video foundation models with a monocular 3D Gaussian Splatting reconstruction and a differentiable graph-geodesic propagation kernel over Gaussian centroids. The central design choice is to propagate prompts by heat-kernel distance on the reconstructed scene graph, rather than by Euclidean nearest neighbors in 3D. This preserves continuity around curved surfaces and reduces leakage across nearby but disconnected objects. This paper describes the repository state, the mathematical kernel implemented in this http URL, the feature head trained from Segment Anything masks, and the validation already present in the codebase. The evaluation protocol separates implementation validation, graph propagation quality, leakage control, and interactive latency.
[CV-269] DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection
链接: https://arxiv.org/abs/2606.00445
作者: Arun Sharma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.
[CV-270] Real-Time Physics Simulation with Dynamic Mesh-Gaussian Reconstructions
链接: https://arxiv.org/abs/2606.00444
作者: Adrian Ramlal,John S. Zelek
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Integrating dynamic 3D reconstructions into physics simulation requires fixed mesh topology for efficient collision detection, but state-of-the-art methods like DG-Mesh produce varying topology optimized for geometric quality. We investigate whether topology conversion can enable physics integration while preserving reconstruction fidelity. We propose a dual-representation framework combining fixed-topology meshes for physics with Gaussian splatting for rendering, achieving 4.65 \times speedup over varying-topology baselines through runtime vertex buffer updates. We evaluate two conversion strategies, temporal correspondence tracking and template-based projection, against native fixed-topology methods (MaGS) on the DG-Mesh dataset. Our evaluation reveals that both conversion approaches incur 65-80% geometric degradation, producing results inferior to MaGS despite DG-Mesh’s superior initial quality. This demonstrates that high-quality reconstruction and physics-compatible topology represent fundamentally distinct objectives that cannot be reconciled through post-processing. Our findings inform future development of physics-aware reconstruction methods and our framework enables real-time simulation with any fixed-topology approach.
[CV-271] Physical Object Understanding with a Physically Controllable World Model CVPR2026
链接: https://arxiv.org/abs/2606.00439
作者: Rahul Venkatesh,Klemen Kotar,Lilian Naing Chen,Wanhee Lee,Gia Ancone,Seungwoo Kim,Luca Thomas Wheeler,Jared Watrous,Honglin Chen,Daniel Bear,Stefan Stojanov,Daniel LK Yamins
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2026 Highlight. Project page at: this https URL
Abstract:A central challenge in visual intelligence is learning the physical structure of scenes from raw videos: how regions form objects and the laws that govern their interactions. Solving these tasks requires world models capable of inferring distributional states of the world from partial observations - capabilities that current architectures do not provide. We introduce a new class of probabilistic world models that support estimation of the probability of any visual variable, such as appearance and dynamics, conditioned on any other variables. Here, we identify that these models can be trained efficiently with autoregressive sequence modeling, yielding world models from which rich object understanding emerges. First, we demonstrate that our model captures the physical laws governing how objects move by generating multiple plausible future states of the world through sequential inference. Then, by analyzing motion correlations across these futures, we extract objects and articulated object subparts. Having discovered these objects, we show that our world model can manipulate them in 3D. Finally, we demonstrate how physical relationships between objects can be computed from the world model, enabling applications such as Visual Jenga.
[CV-272] Detect Before You Leap: Mirag e Detection in Vision-Language Models
链接: https://arxiv.org/abs/2606.00435
作者: Sayeed Shafayet Chowdhury,Md. Shaown Miah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, known as mirage (Asadi et al. 2026), is especially concerning in medical and document visual question answering, where plausible but visually ungrounded responses may be mistaken for image-based evidence. We study pre-release mirage detection: given an image-question pair, the goal is to determine whether a VLM should answer or abstain before producing a response. We propose Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. TC-LIA projects layer-wise image patch tokens into the final CLIP embedding space and measures their similarity to the question embedding, allowing the method to track whether question-relevant visual evidence emerges across vision layers. The resulting alignment trajectory is summarized using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains, three input conditions, and twelve VLM backbones, the best systems achieve approximately 94.6-94.7% three-class detection accuracy with mirage rates below 3%, while baseline mirage rates range from 21.7% to 66.6%.
[CV-273] 4D Radar Meets LiDAR and Camera: Cooperative Perception under Adverse Weather CVPR
链接: https://arxiv.org/abs/2606.00416
作者: Melih Yazgan,Iramm Hamdard,Qiyuan Wu,J.Marius Zoellner
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CVPR - DriveX Workshop
Abstract:Cooperative perception is important for autonomous driving but remains fragile when cameras and LiDAR degrade in adverse weather. We address this challenge by integrating 4D imaging radar as a weather-robust modality into collaborative perception and introducing a Doppler-guided spatial attention mechanism for multi-agent fusion. Our approach extends two representative backbones: a radar-camera pipeline where radar substitutes LiDAR, and a LiDAR-radar pipeline where radar complements LiDAR. To support evaluation, we release radar-augmented benchmarks, OPV2V-R and Adver-City-R, with physics-based LiDAR degradation. Experiments show strong robustness gains in fog and rain, including substantial improvements when radar replaces degraded LiDAR. Additional validation on MAN TruckScenes demonstrates transfer beyond simulation. Overall, our results highlight 4D imaging radar as a robust modality for all-weather collaborative perception. Dataset and code are available at: this https URL.
[CV-274] Rethinking Amortized Neural Representations for High-Resolution Terrain Elevation Data
链接: https://arxiv.org/abs/2606.00404
作者: Haoan Feng,Xin Xu,Leila De Floriani
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 7 figures, 10 tables
Abstract:Implicit neural representations (INRs) model a signal as a continuous coordinate-to-value function. For terrain elevation data, this supports analytic derivatives, arbitrary-resolution decoding, and a smooth surface model of the underlying heightfield. However, fitting and storing a separate INR for every tile does not scale to large terrain datasets. Amortized neural representations reduce this cost with a shared network: a new tile is mapped to a compact per-tile payload, and a shared decoder reconstructs the heightfield from it. Most such methods are hypernetworks that predict the payload in a single forward pass, while others recover it through a short per-tile optimization. These methods were developed primarily for natural images, and their suitability for terrain heightfields remains unclear. We introduce a controlled benchmark on a 1 m/pixel terrain dataset and evaluate three representative methods under a unified protocol. Observing a clear cross-domain gap, we propose HUVR+SIREN, a hypernetwork that adapts the strongest benchmarked method (HUVR) by replacing its coordinate decoder with a smooth, analytically differentiable one. It attains the best height and derivative fidelity on the benchmark with no additional per-tile storage and lower decode cost, and tolerates aggressive post-training quantization with negligible quality loss, giving a compact terrain neural format. Ablations and diagnostics further identify which design choices transfer to terrain and show that the per-tile bottleneck is already near its useful limit, leaving the remaining gap in the shared hypernetwork’s architectural design.
[CV-275] Zamba2-VL Technical Report
链接: https://arxiv.org/abs/2606.00390
作者: Hassan Shapourian,Kasra Hejazi,Olabode M. Sule,Beren Millidge
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures
Abstract:We present Zamba2-VL, a suite of vision-language models built on Zamba2, a hybrid language-model architecture combining Mamba2 state-space layers with a small number of shared transformer blocks. Across a broad range of image understanding, reasoning, OCR, grounding, and counting benchmarks, Zamba2-VL is competitive with leading Transformer-based open-weight VLMs of comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, and substantially outperforms prior SSM-based and hybrid VLMs such as VL-Mamba, Cobra, and mmMamba. Inheriting the near-linear prefill compute and small, near-constant recurrent state of its Zamba2 backbone, Zamba2-VL delivers roughly an order of magnitude lower time-to-first-token (TTFT) than these Transformer baselines at matched parameter scale, with the efficiency gap most pronounced at the smaller 1.2B and 2.7B scales most relevant to on-device and edge deployment. We release three models – 1.2B, 2.7B, and 7B – together with inference code at this https URL.
[CV-276] αDepth: Learning Single-Pass Soft Boundary Decomposition for Stereo Conversion
链接: https://arxiv.org/abs/2606.00386
作者: Xiang Zhang,Yang Zhang,Lukas Mehl,Karlis Martins Briedis,Markus Gross,Christopher Schroers
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately modeling soft boundaries, e.g., hair and defocus blur, is a fundamental challenge in stereo conversion due to the ambiguous blending of foreground and background. Existing depth models primarily predict single-layer depth, leading to ambiguity in depth correspondence at soft boundaries. While matting techniques can capture opacity for layered modeling, they often struggle in complex scenes with multiple targets and usually require user intervention. This paper introduces \alphaDepth, a layered representation that decomposes soft boundaries for high-fidelity stereo conversion. Specifically, we first resolve mixed color and depth ambiguity by estimating layered color and depth values at soft boundaries. Considering complex multi-target scenes, we design a Circular Alpha Representation (CAR) that shifts the paradigm from global target extraction to local boundary decomposition. Unlike prior matting methods restricted to a single foreground/background, CAR enables efficient scene-level inference without manual guidance. Extensive evaluations demonstrate that \alphaDepth achieves state-of-the-art performance in stereo conversion, eliminating background bleeding and structural distortions at soft boundaries.
[CV-277] VESTA: Visual Exploration with Statistical Tool Agents
链接: https://arxiv.org/abs/2606.00384
作者: William Rudman,Abhishek Divekar,Kanishk Jain,Sebastian Joseph,Stella S. R. Offner,Matthew Lease,Kyle Mahowald,Greg Durrett,Junyi Jessy Li
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computation (stat.CO)
备注:
Abstract:Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model’s context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA’s dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.
[CV-278] SUPREME: A Multi-GPU Framework for Reproducible Image Unlearning Method Evaluation
链接: https://arxiv.org/abs/2606.00380
作者: Petros Andreou,Jamie Lanyon,Axel Finke,Georgina Cosma
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages. Code available at this https URL
Abstract:Machine unlearning removes the influence of specific training data from a trained model without retraining it from scratch. Evaluating an unlearning method requires repeating training, unlearning, and evaluation across multiple seeds, which is computationally expensive. To our knowledge, existing image classification unlearning frameworks run on a single GPU, which limits how many seeds can be evaluated in reasonable time. We introduce SUPREME, an open-source framework that distributes these stages across multiple GPUs. SUPREME makes three contributions: a registry-based design for adding new methods, metrics, models, and scenarios; a multi-GPU architecture supporting multiple accelerators and precision modes; and a demonstration on Pins Face Recognition using ResNet18 and ViT under full-class and random-sample unlearning across ten seeds. The framework is available at this https URL.
[CV-279] Non-Learning Low-Light Stereo Vision ICIP2026
链接: https://arxiv.org/abs/2606.00379
作者: Jason Wang,Lucas Nguyen,Hyunseung Eom,Wei Xu,Qi Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIP 2026. Code and data available at this https URL
Abstract:We present a non-learning stereo framework for disparity estimation from severely noisy images. Using the Field of Junctions (FoJ), it retains coarse visual features stable under severe noise for cost volume construction while discarding fine textures inseparable from photon noise. The resulting structural information guides boundary-aware Semi-Global Matching (SGM) that dynamically adapts smoothness penalties to preserve true disparity discontinuities. The output is a sparse disparity map more accurate than those of recent stereo algorithms over unmasked pixels on widely-used benchmark datasets.
[CV-280] Score-Control for Hallucination Reduction in Diffusion Models
链接: https://arxiv.org/abs/2606.00377
作者: Mahesh Bhosale,Naresh Kumar Devulapally,Abdul Wasi,Chau Pham,Vishnu Suresh Lokhande,David Doermann
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have emerged as the backbone of modern generative AI, powering advances in vision, language, audio and other modalities. Despite their success, they suffer from hallucinations, implausible samples that lie outside the support of true data distribution, which degrade reliability and trust. In this work, we first empirically confirm previously proposed hypothesis that score smoothness causes hallucinations in Image Generation diffusion models and provide a density-based perspective. We further formalize this notion by linking the hallucinations probability mass to lipschitz constant of the learned score function. Motivated by this, we introduce a Variance-Guided Score Modulation (VSM) strategy that controls the score Jacobian, in turn reducing score smoothness and better approximating the ground truth score that decreases hallucinations. Empirical results on synthetic and real-world datasets demonstrate that our approach reduces hallucinations (up to ~25%) while maintaining high fidelity and diversity, providing a principled step toward more reliable diffusion-based image generation. We also propose two benchmark datasets with extreme semantic variation for systematic hallucination evaluation. Code and Datasets are publicly available at this https URL.
[CV-281] LFA: Layer Feature Attention for Run-Time Introspection of 2D Object Detectors in Automated Driving
链接: https://arxiv.org/abs/2606.00372
作者: Mert Keser,Alois Knoll
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reliable object detection is critical for automated driving, yet even state-of-the-art detectors inevitably make errors that can compromise safety. Introspection methods that predict detector failures enable safer deployment by triggering fallback mechanisms or alerting human operators. However, existing approaches rely solely on last-layer features or hand-crafted statistics, discarding valuable information from earlier layers that capture different levels of visual abstraction. We propose Layer Feature Attention (LFA), a lightweight introspection method that learns to aggregate features from multiple backbone layers through an attention mechanism. Our key insight is that detection errors manifest differently across feature hierarchies: low-level layers capture fine-grained details essential for detecting small or occluded objects, while high-level layers encode semantic information for scene understanding. LFA learns layer importance weights end-to-end, enabling both improved error prediction and interpretable analysis of which feature levels are most indicative of detector failures. Extensive experiments on KITTI and BDD100K demonstrate that LFA achieves state-of-the-art introspection performance, outperforming single-layer baselines across multiple detector architectures.
[CV-282] HiGS: A Hierarchical Rendering Architecture for Real-Time 3D Gaussian Splatting
链接: https://arxiv.org/abs/2606.00352
作者: Dawid Pająk,Martin Bisson,Rodolfo Lima
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL
Abstract:3D Gaussian Splatting (3DGS) has become the standard for real-time novel view synthesis on commodity GPUs. Its pipeline ties spatial partitioning and rasterization to one tile size, yet the two pull in opposite directions: partitioning, which bins and depth-sorts gaussians, grows cheaper with larger tiles, while rasterization gets cheaper with smaller ones. Prior acceleration work reduces the cost of individual stages but keeps both locked to that single scale, where a few dense tiles dominate frame time. We present Hierarchically Tiled Gaussian Splatting (HiGS), which gives each its own scale: partitioning runs over coarse macro-tiles, while rasterization runs over the fine render tiles within them. Rasterization work is then issued in proportion to the gaussians in each macro-tile rather than per tile, so dense regions spread across many parallel units instead of serializing through one. Across tested scenes, HiGS renders up to 15.8x faster than the original 3DGS and outperforms every other rasterizer we evaluate, while preserving exact front-to-back alpha compositing.
[CV-283] UniVerse: A Unified Modulation Framework for Segmentation-FreeDisentangled Multi-Concept Personalization
链接: https://arxiv.org/abs/2606.00351
作者: Quynh Phung,Sandesh Ghimire,Minsi Hu,Chung-Chi Tsai,Jia-Bin Huang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Personalized visual understanding has advanced significantly, yet existing approaches struggle to localize and extract specific concepts when input images contain multiple objects. Many prior methods rely heavily on segmentation-based supervision or exhibit poor compositional generalization, limiting their ability to accurately disentangle and manipulate individual concepts. In this work, we propose UniVerse, a Unified Modulation Framework for segmentation-free, disentangled multi-concept personalization in diffusion transformers. Our method allows for composable and decomposable concept extraction, enabling fine-grained localization and representation of target objects without explicit segmentation masks. UniVerse learns to decompose complex scenes into concept-specific representations and then compose them in a unified manner, enabling robust personalization across diverse visual contexts. Through extensive experiments on multiple benchmarks, we demonstrate that UniVerse significantly outperforms state-of-the-art baselines in both localization accuracy and visual fidelity. Qualitative and quantitative results show that our approach can precisely extract target concepts in cluttered scenes, paving the way for more flexible, interpretable, and personalized visual generation and understanding.
[CV-284] raining-Free Object-Agnostic Jam Detection in Fulfillm ent Centers
链接: https://arxiv.org/abs/2606.00321
作者: Ruiliang Liu,Tina Dongxu Li,Joshua Migdal,Fernando Ruch,Kenneth Meszaros,Moses Trevor Dardik
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 6 figures. Accepted at the 2026 IEEE International Conference on Automation Science and Engineering (CASE 2026) as a presentation-only paper
Abstract:In fulfillment centers, diverse objects move continuously from inbound to outbound operations and can become jammed due to excessive conveyor friction, incorrect orientation, or mechanical failures. Traditional jam detection approaches rely on object detection models to identify objects, followed by tracking algorithms (such as IoU overlap and Kalman filtering) to monitor motion over time. This pipeline requires thousands of manual annotations, consuming approximately two weeks of effort, and is limited to annotated object classes. We present a training-free, object-agnostic jam detection method that eliminates the need for labeled data. Our approach uniformly samples reference points within the monitoring region when no objects are present. As objects occlude these points, we detect motion. When a sufficient fraction remains occluded beyond a temporal threshold, we classify the event as a jam. Unlike conventional point tracking–which treats occlusion as a failure case–our approach repurposes occlusion as a detection signal, monitoring whether reference points remain persistently occluded rather than tracking where they move. Our experimental evaluation on 1,069 videos demonstrates that AllTracker achieves 100.00% precision and 93.33% F1 score, significantly outperforming classical sparse tracking methods while maintaining training-free deployment. This approach offers three key advantages: (1) no training data or manual annotations, (2) object-agnostic generalization to arbitrary object types, and (3) significantly reduced development time.
[CV-285] Belief Consistency Between Foundation-Model Evidence and Geometric Perception in Persistent Robotic Maps
链接: https://arxiv.org/abs/2606.00318
作者: Christoffer Heckman,Harel Biggie,Brendan Crowe,Nicholas Roy
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Persistent maps used by autonomous robots increasingly fuse a geometric perception stack whose assertions are well-characterized with a foundation-model channel that produces semantic claims without calibrated reliability about the same scene. Contemporary mapping systems integrate the two channels by treating the foundation-model channel as an additional voter into a per-element posterior, uncalibrated for its own per-class reliability and without machinery to flag when the two channels contradict each other at a given moment. We propose an update operator with two cooperating mechanisms: a per-class calibrated commit gate, and a per-event conflict-drop window that refuses to commit foundation-model claims contradicted by the geometric channel at the moment of the claim. We evaluate on KITTI-360 and ScanNet, with an oracle geometric channel (panoptic ground truth) and an off-the-shelf online semantic segmenter (Mask2Former) to demonstrate real-world performance. The operator produces substantially more accurate committed maps (KITTI is car commit precision 99.7% vs. 43.9% for the calibration-only operator; mean per-class IoU 0.522 vs. 0.180), retains more compositional true positives at higher precision than a monolithic compositional VLM prompt. The framework operates at deployment quality across both oracle and off-the-shelf-segmenter geometric channels, and is invariant under foundation-model substitution.
[CV-286] Where to Refine When to Stop: Rethinking Redundancy via Latent Discrepancy for Efficient Visual Autoregressive Generation
链接: https://arxiv.org/abs/2606.00310
作者: Changwang Mei,Peisong Wang,Zekun Li,Changsheng Li,Shuang Qiu,Qinghao Hu,Gang Li,Yifan Zhang,Zhihui Wei,Jian Cheng
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual Autoregressive (VAR) models deliver high-quality image generation but suffer from significant inference latency at high resolutions. Recent acceleration approaches most rely on heuristic measures with layer features to prune tokens. Such heuristics are sensitive to complex contextual semantics, leading to inaccurate identification of redundant computation and poor adaptability across prompts. We rethink redundancy in VAR from the perspective of its impact on pixel-space generation and introduce Latent Discrepancy. This unified metric quantifies a token’s contribution by measuring the change in model states during generation. Our analysis shows that redundancy is more accurately identified when guided by image latent or pixel-space signals. We further observed that in classifier-free guidance (CFG), the convergence trend of the discrepancy between conditional and unconditional branches exhibits high dynamics with different prompts. Based on these findings, we propose LD-Pruning (Latent Discrepancy Pruning), a training-free framework that removes redundancy via latent discrepancy by integrating decoding-free region selection and adaptive unconditional-branch skipping. Extensive experiments show that LD-Pruning substantially reduces inference latency while maintaining high generation quality, achieving up to 2.35x speedup on Infinity-8B.
[CV-287] Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion
链接: https://arxiv.org/abs/2606.00299
作者: Jiayi Wu,Haoming Cai,Cornelia Fermuller,Christopher Metzler,Yiannis Aloimonos
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:While Video Diffusion Models (VDMs) excel at synthesizing high-fidelity videos, enabling precise camera and scene control remains challenging. Existing methods predominantly rely on implicit diffusion priors to generate unobserved regions, inevitably leading to structural collapse during high-dynamic movements or complex occlusions. To address this challenge, we propose Real2SAM2Real, a framework that leverages 3D lifting models (e.g., SAM3D) to extract an explicitly editable 3D cache, serving as a robust geometric scaffold for the VDM. By capturing the entire 3D volume of foreground entities rather than just their visible shells, this cache injects holistic spatial priors into the VDM, providing dependable 3D-aware guidance for complex scene dynamics. To effectively leverage this 3D guidance while preserving pre-trained priors, we design a Soft Spatial-Aligned Injection mechanism alongside a minimally invasive fine-tuning strategy tailored for VDMs. Furthermore, we employ masked normal maps as a cross-modal bridge to construct a 3D-free data curation and perturbation pipeline. Extensive experiments demonstrate that Real2SAM2Real enables precise, decoupled control over both camera trajectories and multi-entity motions. By utilizing the complementary context from generative 3D caches, our framework overcomes typical breakdowns caused by over-reliance on diffusion priors, maintaining exceptional spatiotemporal consistency under large camera shifts and severe occlusions. Crucially, by decoupling geometry from appearance, our VDM-tailored 3D cache eradicates perspective ambiguities caused by structural holes and erroneous facades, as well as misleading cues from reflections and refractions. Project website is available at this https URL
[CV-288] Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
链接: https://arxiv.org/abs/2606.00275
作者: Zijie Zhou,Dandan Zhu,Hangxiangpan Wang,Heng Zhang,Huishen Jiao,Yi Zhao
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5% over MoE variants and up to 3.8% on hallucination-sensitive tasks. AsyMoE activates 25.45% fewer parameters compared to dense models.
[CV-289] StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement
链接: https://arxiv.org/abs/2606.00267
作者: Junwon Seo,Sushant Veer,Ran Tian,Wenhao Ding,Apoorva Sharma,Karen Leung,Edward Schmerling,Marco Pavone,Andrea Bajcsy
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project page: this https URL
Abstract:Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at this https URL.
[CV-290] he Harsh Truth: Segment-Level Analysis of Harsh Driving Events in Milan Using Large-Scale Telematics Street Networks and Google Street View
链接: https://arxiv.org/abs/2606.00261
作者: Andrea La Grotteria,Paolo Santi,Titus Venverloo,Umberto Fugiglando,Carlo Ratti
类目: Computer Vision and Pattern Recognition (cs.CV); Physics and Society (physics.soc-ph)
备注:
Abstract:Police-reported crash statistics remain the standard input for urban road-safety assessment, but their incompleteness and reporting lag limit their usefulness for timely, fine-grained intervention design. Harsh acceleration and braking events are widely used as surrogate safety indicators, but have so far been studied only in comparatively small urban samples. This study analyses harsh events across the urban road network of Milan, combining high-resolution telematics from more than 4.2 million vehicles equipped with On-Board Units, segment-level traffic metrics from TomTom, street-network and infrastructure attributes from OpenStreetMap, and visual streetscape features extracted from Google Street View via semantic segmentation using a OneFormer model. We employ an analytical framework combining non-parametric Mann–Whitney U tests of segment-feature distributions between high- and low-harshness groups with supervised machine-learning regressors. We find that, once exposure is controlled for, wider carriageways, crossings and transit stops, and more open visual fields (higher sky- and road-pixel proportions) are associated with higher harsh-event intensity, while denser built frontage is associated with lower intensity. Finally, the cycling-infrastructure case study identifies a gradient in harsh-event intensity across facility types: markings-only cycle lanes are associated with a 19.5% higher harshness score, and mixed-traffic configurations with an 11.5% higher score, relative to physically separated cycle paths, conditional on the included controls. These results support context-specific rather than uniform urban-safety interventions and illustrate how large-scale telematics combined with open geospatial and visual data can inform Vision Zero decision-making at the metropolitan scale.
[CV-291] LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition
链接: https://arxiv.org/abs/2606.00260
作者: Zishuai Liu,Ruili Fang,Jin Lu,Fei Dou
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Human Activity Recognition (HAR) from ambient sensors enables smart-home applications such as health monitoring and assisted living. In realistic deployments, however, sensor events arrive as a continuous stream and activity boundaries are unknown. Sliding-window inference therefore produces many windows that straddle transitions and contain mixed activities, creating boundary contamination that violates the pre-segmented instance assumption used by most benchmarks and models. Moreover, many pipelines under-use spatial context by treating sensor IDs as independent tokens. We present LastAct, a trajectory-centric framework for streaming smart-home HAR that targets the most recent activity under mixed windows while explicitly modeling spatial structure. LastAct projects sensor events onto the home floorplan to form a layout-aligned trajectory image sequence that preserves spatial continuity. A lightweight gate identifies contaminated windows, and a boundary localizer estimates the last transition to enable boundary-guided masking that emphasizes post-boundary evidence and suppresses stale context. For efficiency, we reuse a precomputed layout-aligned template cache to avoid repeated rendering. Empirically, across four public smart-home datasets under near-realistic mixed-activity protocols, LastAct achieves competitive or superior performance on pure windows and yields substantial Macro-F1 gains on cross/mixed windows, demonstrating improved robustness under near-realistic sliding-window regimes.
[CV-292] APE: Agent ic Prompt Enhancer for Image Generation and Editing
链接: https://arxiv.org/abs/2606.00204
作者: Zijian Huang,Jay Zhangjie Wu,Zian Wang,Tianshi Cao,Jiasi Chen,Sanja Fidler,Huan Ling,Xuanchi Ren
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Natural language has become a powerful interface for image generation and editing, yet text-guided visual systems remain highly sensitive to prompt formulation. Semantically similar requests can produce different outputs depending on wording, specificity, and how explicitly visual constraints are stated, motivating prompt enhancement as a trainable component rather than a peripheral user choice. Existing strong enhancers often rely on large, proprietary LLMs such as ChatGPT or Gemini, adding cost, latency, and deployment dependence to the visual generation pipeline. We propose Agentic Prompt Enhancer (APE), a lightweight framework that post-trains small language models (SLMs) as prompt-enhancement agents. APE supports both single-agent rewriting and role-specialized multi-agent enhancement. Its single-agent instantiation, SAPE, rewrites the prompt in one pass, while its multi-agent instantiation, MAPE, decomposes enhancement into a router–rewriter–composer process for handling compositional constraints over objects, attributes, spatial relations, and edits. With task-aware rewards and post-training protocols, APE improves visual alignment and prompt following without modifying the downstream visual model. Experiments on challenging image generation and editing benchmarks demonstrate that post-trained small prompt enhancers reliably outperform their base counterparts, narrowing the gap to closed-source prompt enhancers; in addition, MAPE proves particularly strong on complex compositional tasks within these benchmarks.
[CV-293] Safe2Drive: Evaluating Safe Driving Behaviors of E2E Autonomous Driving Models
链接: https://arxiv.org/abs/2606.00191
作者: Nishad Sahu,Kalpana Panda,Congyuan Yu,Changzhong Qian,Shounak Sural,Ragunathan Rajkumar
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent end-to-end (E2E) autonomous driving policies achieve high driving scores in closed-loop simulations. Yet it remains unclear whether these policies handle common safety-critical scenarios. We present Safe2Drive (S2D), a set of Bench2Drive-aligned scenario extensions focused on three frequent families of road hazards: work zones, pedestrian jaywalking, and occluded vulnerable road users (VRUs). Safe2Drive adds 100 common but challenging scenarios and introduces SafeDriving Score (SDS), a safety-centric metric that augments prior evaluators with pre-crash braking, work zone-object contact, lane centering, and smoothness checks. Evaluating two state-of-the-art policies (LEAD and SimLingo) on S2D, we find that their driving scores drop sharply relative to their reported Bench2Drive baselines (LEAD: from 94.70 DS on Bench2Drive to 39.95 DS on S2D; SimLingo: from 85.07 DS on Bench2Drive to 41.00 DS on S2D) and that SDS on S2D is low (11.85 for LEAD and 15.27 for Sim-Lingo). These results are consistent with brittle safe-driving behaviors such as poor work-zone understanding, red-light violations, and late or absent braking for pedestrians. This study highlights a lack of safe behavioral reasoning in E2E models even when tested on CARLA towns that are part of the training set. We plan to release the code and videos for all 100 S2D scenarios.
[CV-294] MyoSem: Aligning Electromyography to Natural-Language Action Semantics for Hand Action Understanding
链接: https://arxiv.org/abs/2606.00174
作者: Chiyue Wang,Dong She,Yang Gao,Zhanpeng Jin
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures. Preprint
Abstract:Electromyography (EMG) directly reflects muscle activation and is a key sensing modality for gesture recognition, prosthetic control, and wearable interaction. Existing EMG methods, however, commonly formulate hand action understanding as classification over fixed labels, making it difficult to support querying, retrieval, and generalization based on action descriptions. We present MyoSem, an EMG–action semantic alignment framework that maps low-level EMG signals into a shared semantic space constructed from multi-view action descriptions. MyoSem combines multi-view action-semantic construction, activation-aware EMG encoding, and semantic query alignment, enabling bidirectional retrieval between EMG signals and text descriptions. We systematically evaluate MyoSem on EMG2Pose and NinaPro-series datasets. Results show that MyoSem performs well on EMG–text bidirectional retrieval, generally outperforms most baselines, and shows favorable generalization to unseen users, held-out action classes, and amputee-user transfer scenarios. Ablations and visualizations further validate the effectiveness of each module. Overall, MyoSem advances EMG-based hand action understanding from fixed-label recognition toward queryable bidirectional semantic retrieval, providing a new modeling paradigm for language-mediated EMG action understanding.
[CV-295] Modeling Robotics Dataset Construction as an Artifact-Based Build Process
链接: https://arxiv.org/abs/2606.00162
作者: Leon Pohl,Lukas Beer,George Sebastian,Mirko Maehlisch
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), 6 pages, 6 figures, 2 tables
Abstract:Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at this https URL.
[CV-296] Digital-to-Physical Transfer of Adversarial Patches for Aerial Vehicle Detection
链接: https://arxiv.org/abs/2606.00159
作者: Jung Heum Woo,Eun-Kyu Lee
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 18 pages, 5 figures, 3 tables, preprint
Abstract:Deep neural network (DNN)-based object detectors are widely used for analyzing aerial and satellite imagery in applications such as environmental monitoring and urban analytics. Despite their strong performance, these models are known to be vulnerable to adversarial examples, and physical adversarial attacks using printable patterns pose realistic security threats. In this paper, we evaluate physical adversarial patch attacks against an aerial vehicle detector by bridging digital optimization and real-world deployment. Adversarial patches are optimized in the digital domain using a loss function that minimizes the maximum objectness score while incorporating non-printability score (NPS) and total variation (TV) constraints to ensure both printability and spatial smoothness. The optimized patches are printed and deployed in three configurations: ON, OFF, and OFF-Side. Experiments using a YOLOv3 detector show that while the OFF patch achieves the highest effectiveness in the digital domain (85.51% Average Objectness Reduction Rate (AORR)), the ON patch demonstrates superior robustness in physical environments (0.197-0.343 Objectness Score Ratio (OSR)) due to its consistent visibility. Furthermore, our results indicate that weather-based augmentation does not necessarily improve patch optimization in this domain. These findings provide critical insights into the practical vulnerabilities of aerial object detection systems.
[CV-297] DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion ICML2026
链接: https://arxiv.org/abs/2606.00153
作者: Zhiyang Lu,Ming Cheng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICML2026
Abstract:Cross-modal 2D-3D gait recognition is impeded by inherent domain discrepancies between 2D silhouette and 3D LiDAR range-view representations. While prior methods align only final embeddings, we propose DiffCrossGait, which reformulates cross-modal matching as trajectory-level alignment in an identity-relevant latent diffusion space, rather than assuming full equivalence between 2D and 3D observations. By driving both modalities with shared Gaussian noise within a latent space, we enable continuous alignment throughout the generative evolution. We introduce a Tri-Phase Alignment Strategy that exploits varying noise intensities to enforce identity anchoring, dynamics consistency, and cross-modal structural recoverability, thereby constraining both modalities to share denoising dynamics and bottleneck structure, which promotes modality-invariant gait features. Crucially, our framework decouples generative alignment from the discriminative backbone; the diffusion mechanism serves exclusively as a training objective, ensuring high inference efficiency by eliminating the computational overhead of iterative denoising. Extensive experiments on the SUSTech1K and FreeGait benchmarks demonstrate that DiffCrossGait achieves state-of-the-art performance.
[CV-298] StemBind: When MLLM s Get Lost Between Rules and Instances in Abstract Visual Reasoning
链接: https://arxiv.org/abs/2606.00148
作者: Xixiang He,Baiqi Wu,Xingming Li,Ao Cheng,Qiyao Sun,Xuanyu Ji,Qingyong Hu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL
Abstract:Multimodal large language models (MLLMs) often know the rule but pick the wrong answer: on abstract visual reasoning (AVR) tasks, a model can describe what it sees and name the underlying pattern, yet still fail to choose the matching candidate. Existing AVR benchmarks cannot detect this because they collapse perception, rule induction, and answer selection into a single right-or-wrong signal. We introduce StemBind, a shared-stem diagnostic benchmark that probes the same visual stem with three aligned questions: Perception (what is in the image), Rule (what pattern governs it), and Full (which option completes it), so a final-answer error can be attributed to a specific sub-step on the same evidence. StemBind contains 2,298 curated knowledge-light stems across nine auditable visual operations, totaling 19,533 P/R/F tasks, with each full item annotated by Sternberg’s four reasoning stages (S1 Encode, S2 Infer, S3 Map, S4 Apply). Evaluating 24 frontier MLLM configurations yields four findings. (i) The R-F chasm: rule accuracy exceeds full-item accuracy on 22 of 24 models, so most failures happen after the rule is identified. (ii) A persistent binding gap: even when P and R are both correct on the same stem, models still answer F incorrectly 51.2% of the time. (iii) The bottleneck is S3: process diagnostics and Stage-wise Stimulus Augmentation localize the dominant failure to rule-to-instance mapping. (iv) Scaling and thinking do not help: neither larger models nor explicit thinking mode reliably closes the gap, and thinking even lowers rule and full-item accuracy. StemBind reframes AVR evaluation from final-answer ranking to locating where abstract visual reasoning breaks down, identifying rule-to-instance binding as a concrete next target for vision-grounded reasoning.
[CV-299] Geodesics with Unified Tangent-constrained Priors and Curvature Regularization
链接: https://arxiv.org/abs/2606.00139
作者: Chong Di,Li Liu,Jinglin Zhang,Zhenjiang Li,Da Chen,Laurent D. Cohen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Curvature-penalized geodesic models have proven their effectiveness in image segmentation by computing globally optimal curves. Unfortunately, these models remain susceptible to shortcuts when delineating objects with complex shapes and image intensity distributions, as they lack mechanisms to enforce shape-aware tangent constraints. To address this limitation, we propose a unified geodesic framework that integrates tangent-constrained priors with curvature penalization. The key idea is to formulate tangent admissibility directly within the orientation-lifted space, where path tangents are restricted to spatially varying angular sectors derived from intrinsic shape representatives (ISR) such as skeletons or interior landmarks. This formulation gives rise to a family of tangent-constrained Finslerian metrics, extending the classical curvature-penalized geodesic models while enforcing mandatory tangent constraints. The resulting Hamilton-Jacobi-Bellman (HJB) partial differential equations (PDEs) admit efficient numerical solutions via variants of the fast marching method, preserving the single-pass computational complexity. Experiments on synthetic, natural, and medical images demonstrate that the proposed geodesic framework indeed improves robustness against weak boundaries and topological shortcuts, yielding segmentation results with enhanced shape fidelity compared to existing geodesic models.
[CV-300] Advances in Neural 3D Mesh Texturing: A Survey
链接: https://arxiv.org/abs/2606.00137
作者: Sai Raj Kishore Perla,Hao Zhang,Ali Mahdavi-Amiri
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Eurographics STAR (Computer Graphics Forum), 2026. Project Page: this https URL
Abstract:Texturing 3D meshes plays a vital role in determining the visual realism of digital objects and scenes. Although recent generative 3D approaches based on Neural Radiance Fields and Gaussian Splatting can produce textured assets directly, polygonal meshes remain the core representation across modeling, animation, visual effects, and gaming pipelines. Neural 3D mesh texturing therefore continues to be an essential and active area of research. In this survey, we present a comprehensive review of recent advances in neural 3D mesh texturing, covering methods for texture synthesis, transfer, and completion. We first summarize key foundations in mesh geometry, texture mapping, differentiable rendering, and neural generative models, and then organize the literature into a unified taxonomy spanning early GAN-based methods to modern diffusion-based pipelines. We further analyze common architectures and supervision strategies, review datasets and evaluation protocols, and discuss emerging applications, practical/commercial systems, and open challenges. Together, these insights provide a structured perspective on the current landscape and help guide future developments in learning-based 3D mesh texturing.
[CV-301] Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness ICML2026 NEURIPS2026
链接: https://arxiv.org/abs/2606.00124
作者: Mahmoud Mannes
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages (9 main text, 7 appendix). 5 figures (3 main text, 2 appendix) with 8 graphics total. 5 tables (1 main text, 4 appendix). Submitted to NeurIPS 2026 main conference and the ICML 2026 mechanistic interpretability workshop
Abstract:Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.
[CV-302] CardioLens: Revealing the Clinical Reality Gap of MLLM s via Multi-Sequence Cardiac MRI Evaluations
链接: https://arxiv.org/abs/2606.00123
作者: Zixian Su,Hongkai Zhang,Fan Gao,Encheng Su,Taiping Qu,Jingwei Guo,Nan Zhang,Hui Wang,Zhen Zhou,Kairui Bo,Yan Chen,Yue Ren,Shuai Li,Lei Xu,Henggui Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.
[CV-303] Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity
链接: https://arxiv.org/abs/2606.00121
作者: Yizhuo Lu,Changde Du,Qiongyi Zhou,Liuyun Jiang,Huiguang He
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task in brain decoding. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Recent methods, leveraging advances in the power of text-to-image generation models, have reconstructed images that closely approximate complex natural stimuli in terms of semantics (e.g., concepts and objects). However, they struggle to maintain consistency with the original stimuli in fine-grained structural information (e.g., position, orientation and size), which undermines both the controllability and interpretability of the models. To address the aforementioned issues, we propose a two-stage image reconstruction framework, termed MindDiffuser. In Stage 1, Contrastive Language-Image Pretraining (CLIP) text embeddings decoded from brain responses are input into Stable Diffusion, generating a preliminary image containing semantic information. In Stage 2, we use decoded shallow CLIP visual features as supervisory signals, iteratively refining the feature vectors from Stage 1 via backpropagation to align structural information. We conducted extensive experiments on brain response datasets across three modalities (fMRI, EEG, MEG) elicited by visual stimuli, demonstrating that our framework significantly enhances the performance of previous state-of-the-art models, highlighting the effectiveness and versatility of our approach. Spatial and temporal visualization results further support the neurobiological plausibility of our framework, providing guidance for future neural decoding efforts across different brain signal modalities.
[CV-304] Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions ICML2026
链接: https://arxiv.org/abs/2606.00115
作者: Yuanyuan Wang,Wenjie Wang,Kun Zhang,Mingming Gong
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted at ICML 2026
Abstract:Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at this https URL.
[CV-305] Recursive Vision Transformer with Dynamic Depth and Width Adjustment for Resource-Efficient Image Semantic Communication
链接: https://arxiv.org/abs/2606.00114
作者: Zhilong Zhang,Xinhui Zhang,Gongyu Jin,Sihua Wang,Danpu Liu,Changchuan Yin
类目: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
备注:
Abstract:Image semantic communication is a critical component in next-generation wireless communication systems. However, such systems typically suffer from large memory footprints and high computational complexity, making them difficult to deploy on resource-constrained devices. To address these challenges, we propose a vision transformer (ViT)-enabled image semantic communication system. In this system, a recursive structure is introduced to iteratively refine semantic features and reduce the parameter count. In addition, three dynamic adjustment strategies are designed to adaptively reduce computational complexity: dynamic depth adjustment, dynamic width adjustment, and joint width-depth optimization. Dynamic depth adjustment adaptively determines the number of recursive modules according to image content and channel conditions, while dynamic width adjustment selectively preserves important neurons and attention heads. The joint width-depth optimization further enables flexible computation configurations. Simulation results verify that the proposed recursive ViT-based system, combined with the three dynamic adjustment strategies, reduces the parameter count by 48.7% and achieves higher reconstruction quality than existing baselines under comparable computational complexity.
[CV-306] Evolving to the Aesthetics of a Vision-Language Model
链接: https://arxiv.org/abs/2606.00112
作者: Stephen James Krol,Jon McCormack
类目: Neural and Evolutionary Computing (cs.NE); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper presented at ICCC26, June 29 - July 3, 2026, Coimbra, Portugal
Abstract:Evolutionary systems have demonstrated remarkable results in creative domains, with recent applications in generative typography, design, and music. However, an open problem remains in designing fitness functions that effectively capture the desired aesthetics of abstract outputs. In this work, we explore two methods for evaluating the aesthetics of a population using Vision-Language Models (VLMs). The first method uses CLIP-IQA to predict an aesthetic score for each design. The second method instead pits candidates against each other, with winners determined by a VLM using a custom prompt specified by the user. The outcomes of these pairwise comparisons are then used to estimate a population ranking via the Glicko rating system. We present these methods in the context of a case study using a custom generative system and compare the resulting rankings with an artist’s aesthetic ranking and those produced by other aesthetic evaluation techniques. Additionally, we document the artist’s experience using these approaches to evolve designs, critically analysing the strengths and weaknesses of both methods.
[CV-307] General Covariant Action Modeling: Constructing Generalized Manifolds via Spatio-Temporal Decoupling
链接: https://arxiv.org/abs/2606.00110
作者: Huaihai Lyu,Chaofan Chen,Mingyu Cao,Yuheng Ji,Changsheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Achieving robust generalization from limited data is a central challenge in embodied intelligence. Prevailing methods fail by regressing absolute coordinates, which violates the principle of general covariance. Fundamentally, this conflates the intrinsic task geometry with rigid execution patterns, binding policies to specific motion styles and fixed speeds. To resolve this, we propose the Generalized Action Manifold (GAM) framework that enforces general covariance through structural disentanglement. Specifically, GAM realizes the manifold by enforcing invariance across two orthogonal dimensions: (1) Temporal Invariance, utilizing an Arc-Length Parameterizer to orthogonalize the spatial path geometry from temporal dynamics, ensuring robustness to velocity variations; (2) Geometric Invariance, where a Schema-Affine-Factorization mechanism maps trajectories to canonical ``world lines’’ in a pose-normalized coordinate frame. This distinguishes invariant geometric schemas from affine modulations, ensuring spatial generalizability. By integrating GAM within a structured Vision-Language-Action (VLA) architecture, we enable sparse demonstrations to densely populate a continuous, valid action manifold. Empirical results demonstrate that GAM enables superior transfer and robustness capabilities, outperforming geometry-agnostic baselines.
[CV-308] VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography MICCAI2026
链接: https://arxiv.org/abs/2606.00109
作者: Haoyuan Tang,Zhuo Zhang,Jialin Li,Shuai Xiao,Jiachen Yang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early accept to MICCAI 2026
Abstract:Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63% to 86.27%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.
[CV-309] Visual-Noise Guided In-Context Distillation for Multimodal Large Language Model Unlearning
链接: https://arxiv.org/abs/2606.00105
作者: Junkai Chen,Yuhao He,Junxiang You,Ruiqi Liu,Chenyu Wang,Shu Wu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have achieved remarkable progress on vision-language tasks, but they may also memorize and expose sensitive or restricted knowledge, raising concerns about privacy and broader safety risks. Machine Unlearning (MU) provides a promising way to remove targeted undesirable knowledge from trained models without retraining from scratch while preserving general model utility. Nevertheless, effective unlearning in MLLMs remains particularly challenging. Existing training-based methods often struggle to balance unlearning effectiveness and model utility. In contrast, training-free methods such as in-context unlearning preserve model utility by avoiding parameter updates, but they do not remove memorized knowledge at the parameter level and may remain vulnerable to reverse-engineering attacks. More importantly, in-context unlearning is insufficient in multimodal settings, where visual inputs can provide strong conditioning signals and induce undesirable outputs. To address these challenges, we propose Visual-Noise Guided In-Context Distillation (VGID), a distillation-based framework for MLLM unlearning. VGID dynamically constructs an unlearning-oriented teacher distribution from the frozen base model through dual-modal intervention that combines visual perturbation with textual in-context unlearning. The resulting intervention-induced distribution serves as a teacher signal for distillation, guiding the student model toward parameter-level unlearning without requiring external teacher models or explicit undesirable response annotations. Experimental results show that VGID achieves strong unlearning effectiveness while preserving competitive model utility, reducing forget set ROUGE-L by 0.371 with only a 0.055 drop in retain set ROUGE-L in a representative setting.
[CV-310] CoCoVideo: The High-Quality Commercial-Model-Based Contrastive Benchmark for AI-Generated Video Detection CVPR2026
链接: https://arxiv.org/abs/2606.00101
作者: Huidong Feng,Wentao Chen,Jie Chen,Xinqi Cai,Ruolong Ma,Yinglin Zheng,Yuxin Lin,Ming Zeng
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepected by CVPR 2026
Abstract:With the rapid advancement of artificial intelligence generated content (AIGC) technologies, video forgery has become increasingly prevalent, posing new challenges to public discourse and societal security. Despite remarkable progress in existing deepfake detection methods, AIGC forgery detection remains challenging, as existing datasets mainly rely on open-source video generation models with quality far below that of commercial AIGC systems. Even datasets containing a few commercial samples often retain visible watermarks, compromising authenticity and hindering model generalization to high-fidelity AIGC videos. To address these issues, we introduce CoCoVideo-26K, a contrastive, commercial-model-based AIGC video dataset covering 13 mainstream commercial generators and providing semantically aligned real-fake video pairs. This dataset enables deeper exploration of the differences between authentic and high-quality synthetic videos and establishes a new benchmark for highly realistic video forgery detection. Building on this dataset, we propose CoCoDetect, a detection framework integrating contrastive learning with confidence-gated multimodal large language model (MLLM) inference. An R3D-18 backbone extracts spatio-temporal representations, while a confidence gate routes uncertain cases to an MLLM for reasoning about physical plausibility and scene consistency. Extensive experiments on CoCoVideo-26K and public benchmarks demonstrate state-of-the-art performance, validating the framework’s robustness and generalizability. Our code and dataset are available at this https URL.
[CV-311] CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout
链接: https://arxiv.org/abs/2606.00100
作者: Tongxi Song,Ziyu Li,Zihan Li,Wen Zhong,Congyu Liao,Yang Yang,Hua Guo,Wenchuan Wu,Qiyuan Tian
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.
[CV-312] Segmentation-Guided Spatial Indexing for Generalizable and Explainable Deepfake Detection
链接: https://arxiv.org/abs/2606.00098
作者: Izaldein Al-Zyoud,Abdulmotaleb El Saddik
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:We introduce segmentation-guided spatial indexing for generalizable and explainable deepfake detection. The key idea reverses the standard design order: rather than pooling all facial tokens and classifying afterward, we first select semantically meaningful patch tokens, then pool only those. A frozen FaRL parser assigns each DINOv3 ViT-L/16 patch token a semantic label; non-target tokens are discarded; a linear probe classifies the retained region. This spatial indexing exploits DINOv3’s patch-level spatial consistency, the same property that enables emergent segmentation, to present the probe with a purer regional subspace where manipulation-relevant evidence is less diluted by whole-face cues. Region attribution is structural: when the mouth model predicts fake, the decision used only mouth tokens, not an overlaid saliency map. On Celeb-DF v2, the mouth-indexed probe achieves AUC 0.905, outperforming LipForensics (+8.1 pp) and Xception (+16.9 pp), with no DINOv3 or FaRL fine-tuning and no target-domain data. Ablations isolate the mechanism: replacing regional selection with DINOv3’s CLS token drops Celeb-DF v2 AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. Both DINOv3 representation and the spatial index are independently necessary; neither alone approaches the full system.
[CV-313] Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents ICML2026
链接: https://arxiv.org/abs/2606.00096
作者: Dong-Hee Kim,Reuben Tan,Donghyun Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Presented in ICML 2026
Abstract:Visual agents employ external visual tools within visual chains of thought to incorporate fine-grained evidence. While prior work has mainly studied these tools in visual search tasks, their role in more complex visual reasoning remains underexplored. In this paper, we move beyond simple visual search tasks to investigate more challenging tasks, including 3D spatial reasoning and medical visual question answering, where agents must integrate tool-acquired local evidence with the global context. We identify a tool-use collapse phenomenon: models progressively stop using tools while still achieving higher task accuracy. Moreover, we observe a clear asymmetry: (i) completely eliminating tool use degrades performance, whereas (ii) incentivizing tool use yields only marginal gains despite substantially increasing usage. We find that vanilla training and tool-use encouragement both reduce rollout diversity, explaining why higher tool use does not yield stronger reasoning performance. Motivated by these findings, we add an entropy regularization term to encourage diverse rollout exploration, achieving the best performance despite gradually declining tool usage. % We further observe similar dynamics on medical VQA, suggesting that tool-use collapse is not limited to 3D spatial reasoning. Overall, our findings suggest a training-time view of tools as scaffolding, where broader exploration over language generation and visual tool invocation improves reasoning despite tool-use collapse. Project page: this https URL Comments: Presented in ICML 2026 Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00096 [cs.CV] (or arXiv:2606.00096v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.00096 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dong-Hee Kim [view email] [v1] Mon, 25 May 2026 13:06:59 UTC (6,509 KB) Full-text links: Access Paper: View a PDF of the paper titled Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents, by Dong-Hee Kim and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CV prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[CV-314] Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry
链接: https://arxiv.org/abs/2606.00094
作者: Duoduo Xue,Zhiyu Zhu,Junhui Hou
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image generative models aim to sample data points from the underlying data manifold, a task that requires learning and decoding a dense, low-dimensional, and compact parameterization space. To achieve this, we propose the Data Manifold-aware Image diffusioN moDel (MIND), a novel framework that explicitly models manifold geometry by integrating discrete patch tokenization into the score function of a continuous diffusion model. This approach successfully leverages both the structural quantification capabilities of discrete tokens and the parallel generation flexibility of continuous diffusion. Moreover, we enable end-to-end differentiable training via a novel soft top- k aggregation mechanism and introduce dual-branch high-frequency feature embedding layers to alleviate the spectral bias of transformer backbones on low-dimensional inputs. Furthermore, for inference, we design a multi-stage transition sampling scheme that dynamically adjusts the sampling scheme based on timestep. Extensive experiments on ImageNet 256 \times 256 demonstrate the effectiveness of MIND. After 80-epoch training, our base model achieves an FID of 22.73 without guidance, nearly halving the 43.47 FID of the vanilla DiT-B/2 baseline. The proposed method reduces FID by 15.95 and 9.06 on average compared with the baselines DiT and SiT, respectively. For image generation on ImageNet-256 \times 256 with guidance, the proposed MIND-B with only 130M parameters achieves an FID of 2.06, superpassing the LlamaGen-3B with 3.1B parameters. The proposed MIND-XL with 715M parameters further reduces the FID to 1.95. Our MIND introduces a fresh perspective on diffusion-based image generation, paving the way for future research and innovation in this community. The code will be publicly available.
[CV-315] Aligning Cellular Sheaves with Classifier Attention for Interpretable Weakly-Supervised Pathology Localization
链接: https://arxiv.org/abs/2606.00092
作者: Devansh Lalwani,Swapnil Bhat,Maulik Shah
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Weakly-supervised classification of whole-slide images with attention-based multiple instance learning (ABMIL) on top of foundation features now reaches near-saturation on Camelyon16 slide-level performance, but the corresponding attention maps are an imperfect localization signal: in clinical interpretation, a model that classifies correctly without firing on the actual lesion is hard to trust. We address this gap with cellular sheaves, which equip each vertex and edge of a graph with a finite-dimensional vector space and consistent linear maps between them, providing a principled way to detect local disagreement on graph-structured data. We apply cellular sheaves to weakly-supervised tumour localization on whole-slide images, combining a sheaf disagreement field with ABMIL. The natural training objective, encouraging consistency between similar features, produces a disagreement field that tracks tissue-level texture rather than diagnostic content. We propose attention-conditional consistency, which uses the classifier’s attention to define which neighbouring patches should agree. Joint training of the classifier and the sheaf under this objective produces a disagreement field with patch-level AUC 0.940 on Camelyon16 and raises the attention head from its ABMIL-alone level of 0.717 to 0.953. Two-stage ablation with the classifier frozen at its ABMIL values reaches only 0.727 on the disagreement field and leaves attention at 0.717, confirming that the gain comes from the projector co-adapting under both objectives, not from the loss change in isolation. The trained model transfers without retraining to annotated slides from Camelyon17, maintaining Delta AUC 0.932 +/- 0.083 and attention AUC 0.955 +/- 0.099. The result is an attention map and a sheaf-disagreement map that fire on the same diagnostic regions, giving clinicians two complementary explanations for each slide-level prediction.
[CV-316] Structured Visual Evidence Decomposition for Evidence-Grounded Multimodal Screening of Obstructive Sleep Apnea-Hypopnea Syndrome
链接: https://arxiv.org/abs/2606.00087
作者: Chen Zhan,Yingchen Wei,Xiaoyu Tan,Jingjing Huang,Xihe Qiu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective pre-polysomnography screening for obstructive sleep apnea-hypopnea syndrome (OSAHS) requires combining clinical risk factors with visible craniofacial and neck cues. Directly prompting general-purpose multimodal foundation models for medical yes/no decisions can yield unstable, poorly calibrated outputs. We propose EviOSAHS, an evidence-grounded multimodal reasoning framework that separates image-only anatomical evidence acquisition from final clinical adjudication. Each frontal facial image is decomposed into seven fixed anatomical queries covering the neck, chin, mouth, face/neck fat, lower jaw, midface, and nose. Visual responses are converted into structured evidence cards recording target anatomy, visibility, risk direction, evidence strength, confidence, and a concise summary. These cards are combined with a cleaned clinical profile only in the final stage, where a large language model performs balanced binary screening adjudication. We evaluated EviOSAHS on a 642-subject cohort, mapping normal subjects to screening-negative and mild, moderate, or severe OSAHS subjects to screening-positive. EviOSAHS achieved 88.47% accuracy, 94.86% sensitivity, 93.74% F1-score, and a 5.14% false-negative rate, outperforming clinical-only prompting, direct multimodal prompting, and naive two-stage pipelines under a unified protocol. Ablations showed that seven-question visual decomposition and balanced final adjudication were critical to the high-sensitivity operating point. A question-level audit of 4,494 visual outputs showed a 100% structured parse rate and 93.88% high-visibility rate. EviOSAHS provides an auditable, high-sensitivity workflow for binary pre-polysomnography OSAHS screening, but should be viewed as a triage assistant rather than a diagnostic system. Prospective validation, external testing, and calibrated operating-point control are needed before clinical deployment.
[CV-317] Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems
链接: https://arxiv.org/abs/2606.00080
作者: Alan Gerson Contreras Montanares,Luis Valenzuela,Luis Martí,Nayat Sanchez-Pi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image–text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.
[CV-318] Flow-Based Generative Modeling for Optimizing Sampling Policies in Compressed Sensing Applications
链接: https://arxiv.org/abs/2606.00078
作者: Roman Pavelkin,Luis A. Zavala-Mondragon,Christiaan G. A. Viviers,Fons van der Sommen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Numerous modern applications in signal processing and medical imaging necessitate acquiring high-dimensional signals under tight resource constraints. Traditional sampling theory suggests that accurate signal reconstruction requires a number of measurements proportional to the signal’s ambient dimension, a requirement often too expensive or impractical. Compressed sensing challenges this notion by demonstrating that sparse signals can be recovered with fewer measurements, provided the measurement operator meets certain conditions. This proof-of-concept study presents a task-aware flow-based generative framework – a reformulation of the conventional Flow Matching training paradigm with a flow model trained to optimize subsampling in compressed sensing applications. We establish the fundamental feasibility of the proposed framework of learning subsampling masks that substantially enhance the performance of compressed sensing for image classification, image reconstruction, and MRI acceleration. For the image reconstruction task, our method demonstrated state-of-the-art performance, achieving Peak Signal-to-Noise Ratio of 25.17 dB at the subsampling rate of 5% on the CelebA dataset and 29.24 dB when reconstructing 8\times accelerated MRI measurements (fastMRI dataset) with the minimal computational overhead. These results highlight the effectiveness of task-conditioning within generative flow models and reveal a promising direction for representation learning strategies. Overall, the proposed framework offers a unified, flexible approach to designing data- and task-driven sensing schemes that can be potentially adapted to a broad range of inverse problems.
[CV-319] Improved Belief-Attention in Vision Task
链接: https://arxiv.org/abs/2606.00077
作者: Guoqiang Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, Belief-Attention \citeGuoqiang25BeliefAttention has been proposed by first performing an orthogonal projection of the softmax-based weighted summation of V vectors with respect to the original V vectors and then taking the perpendicular component as the residual signal in Transformer for performance improvement. In this paper, we first conduct an ablation study showing the projected component also carries information about the token correlation, which should not be ignored. We then propose to extend Belief-Attention by making use of both the perpendicular and projected components. In particular, the projected component goes through certain activation function and then a linear mapping before merging with the considered token. Conceptually speaking, the neural block for the projected component can be viewed as a two-layer feedforward network (FFN) within the new attention block. It is also noted that standard attention captures the token correlation via the inner-product matrix QK^T . We propose to introduce an additional inner-product matrix ZZ^T to QK^T to capture richer token correlation. We refer to the new module as Belief2-Attention. It can be easily shown that Belief2-Attention is more expressive than standard Attention. We then verify the effectiveness of Belief2-Attention for vision tasks of image classification and segmentation.
[CV-320] DefocusTrackerAI – A Generalized Framework for the Automatic Detection of Defocused Particle Images
链接: https://arxiv.org/abs/2606.00076
作者: Gonçalo Coutinho,Ana S. Moita,António L. N. Moreira,Massimiliano Rossi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 10 figures
Abstract:The present work introduces DefocusTrackerAI, a generalized deep-learning framework for the automatic detection and position estimation of defocused particle images from any kind of optical configuration without compromising uncertainty and recall, intended as a follow-up of the open-source project DefocusTracker. We selected the deep neural network architecture from the direct comparison of two well-known object detection models, Faster R-CNN and YOLOv9, trained on a diverse and feature-rich synthetic image set containing astigmatic and non-astigmatic defocused particle images of varying diameters. The model evaluation on synthetic data showed that, first, YOLOv9 outperforms Faster R-CNN, achieving higher recall and lower uncertainty, particularly at high particle image densities; and second, that YOLOv9 provides enhanced spatial resolution, with uncertainty values between 0.1 and 0.4 pixels for particle image densities N_s up to 0.5, outperforming state-of-the-art algorithms. We demonstrated that our models are able to detect astigmatic and non-astigmatic defocused particle images in multiple optical setups with varying lighting conditions. In addition, we successfully applied our models on real DPT experiments, including fluorescence and shadowgraph data, showing that they can be used beyond conventional DPT applications, including the tracking of sprays and droplets. A pre-trained, ready-to-use version of DefocusTrackerAI based on YOLOv9 is available at this https URL and can be used for automatic detection of defocused particle images of any kind with high accuracy. In combination with a suitable calibration approach for the depth position, it can be used as an effective first step for three-dimensional defocusing particle tracking.
[CV-321] From Human Videos to Robot Manipulation: A Survey on Scalable Vision-Language-Action Learning with Human-Centric Data IJCAI2026
链接: https://arxiv.org/abs/2606.00054
作者: Zhiyuan Feng,Qixiu Li,Huizhi Liang,Rushuai Yang,Yichao Shen,Zhiying Du,Zhaowei Zhang,Yu Deng,Li Zhao,Hao Zhao,Zongqing Lu,Oier Mees,Marc Pollefeys,Jiaolong Yang,Baining Guo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IJCAI 2026 Survey Track. Project page: this https URL
Abstract:Recent progress in generalizable embodied control has been driven by large-scale pretraining of Vision-Language-Action (VLA) models. However, most existing approaches rely on large collections of robot demonstrations, which are costly to obtain and tightly coupled to specific embodiments. Human videos, by contrast, are abundant and capture rich interactions, providing diverse semantic and physical cues for real-world manipulation. Yet, embodiment differences and the frequent absence of task-aligned annotations make their direct use in VLA models challenging. This survey provides a unified view of how human videos are transformed into effective knowledge for VLA models. We categorize existing approaches into four classes based on the action-related information they derive: (i) latent action representations that encode inter-frame changes; (ii) predictive world models that forecast future frames; (iii) explicit 2D supervision that extracts image-plane cues; and (iv) explicit 3D reconstruction that recovers geometry or motion. Beyond this taxonomy, we highlight three key open challenges in this area: structuring unstructured videos into training-ready episodes, grounding video-derived supervision into robot-executable actions under embodiment and viewpoint heterogeneity, and designing evaluation protocols that better predict real-world deployment performance and transfer efficiency, thereby informing future research directions. A curated list of papers and resources is available at this https URL.
[CV-322] When Jokes Cross the Line: Analyzing Regular Humor and Dark Humor in YouTube Shorts
链接: https://arxiv.org/abs/2606.00046
作者: Sydney Johns,Sanjeev Parthasarathy,Shantnu Bhalla,Vaibhav Garg
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注:
Abstract:Video platforms such as YouTube have reshaped how users engage with entertainment and information, emphasizing brief, highly engaging content such as Shorts. Within this ecosystem, certain content occupies a gray area where it remains allowed but may still have unintended negative effects on some audiences. To study this problem, we introduce TwistedHumor, a dataset of 1,211 YouTube Shorts paired with 33,041 related comments, with hand annotations for humor presence, humor type, harm, topic, rhetorical devices, and stand up context. Beyond dataset creation, we present a multi view analysis of how humor and harm appear in short form social media. Using LLooM based concept induction over video descriptions, we find that dark humor frequently clusters around themes of critique, coping, awkwardness, and identity expression rather than appearing as a single uniform category. We further analyze audience response through linked comments and show that regular humor is associated with more positive sentiment, while dark humor receives more mixed, neutral, and sometimes more toxic reactions. Finally, we evaluate large language models against human annotations and find that they perform better on stand up comedy compared to shorter jokes. Together, these results position TwistedHumor not only as a new benchmark, but as an empirical study of the gray area between humor and harm in short form video, highlighting the need for context aware moderation and more robust multimodal evaluation.
[CV-323] Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models CVPR2026
链接: https://arxiv.org/abs/2605.31162
作者: Shreyansh Modi,Akshat Tomar,Aarush Aggarwal
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 12 figures, Generative Models for Computer Vision Workshop CVPR 2026
Abstract:Unconditional diffusion models offer powerful generative priors, yet steering them toward aesthetically enhanced outputs remains largely unexplored. We show that h-space patching, the dominant paradigm for training-free diffusion editing, systematically fails for global, low-level transformations required for aesthetic and perceptual refinement. We introduce a novel, generalized framework for image-editing in unconditional diffusion models without explicit training. This inference-time mechanism operates on low-level features by extracting degradation concept vectors and combining bottleneck patching with classifier-free guidance to guide sampling away from the degraded manifold, producing consistently improved images without any model retraining.
[CV-324] Bayesian meta-learning for modeling Alzheimers disease progression
链接: https://arxiv.org/abs/2606.02228
作者: Clara Hoffmann,Nadja Klein
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Predicting whether an individual with Alzheimer’s disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual’s current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual’s historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.
[CV-325] LALE: Lightweight-Transformer Architecture for Land-Cover Estimation
链接: https://arxiv.org/abs/2606.02092
作者: Ümit Mert Çağlar,Alptekin Temizel
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semantic segmentation of remote sensing imagery requires models that capture both global context and local detail under tight computational budgets. Prior work typically optimizes for one of these axes: attention for global context, convolution for local detail, or compactness for efficiency. While hybrid approaches aim to capture both, they require architectural changes and encoder backbones with computational overhead, limiting efficiency and performance. We present LALE (Lightweight-transformer Architecture for Land-cover Estimation), an end-to-end remote sensing image segmentation architecture, that bifurcates its encoder by resolution: lightweight ConvMixer stages handle high-resolution local features, while transformer stages handle low-resolution global context, confining the quadratic cost of self-attention to deep, downsampled feature maps. An all-MLP multi-scale decoder, together with RMSNorm and StarReLU throughout, further reduces compute and parameter count. On the large-scale ARAS400k remote-sensing segmentation benchmark, LALE establishes a strong efficiency-performance trade-off against CNN, transformer, and hybrid baselines. Our smallest variant, (just 1.6M parameters), reaches within 2.6 F1 points of the best baseline (UPerNet) while using 4.5x fewer parameters, 7x less storage, 17x fewer GMACs, and delivering 1.8x higher throughput.
[CV-326] Physics-Aware Linearized ADMM and Its Unrolling
链接: https://arxiv.org/abs/2606.01652
作者: Satoshi Takabe,Shunta Arai,Tadashi Wadayama
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 3 figures
Abstract:Recently, partial differential equations (PDEs) have been used to directly model the measurement process in signal processing, although their evaluation is costly. In this paper, we propose a novel alternating direction method of multipliers (ADMM)-based algorithm called physics-aware linearized ADMM (PA-LADMM) for inverse problems from PDE-based measurement processes. The key idea is the linearization of the subproblem with PDEs, leading to a cost-efficient update rule that calls only a PDE solver and its gradient evaluation per iteration. The algorithm has a theoretical convergence guarantee under certain conditions. In addition, we combine it with deep unfolding to unroll the PA-LADMM and train its internal parameters using supervised data. Two distinct experiments, compressed sensing with optical fiber communication and image restoration from noisy anisotropic diffusion, demonstrated the effectiveness of the proposed algorithms.
[CV-327] PINNOCHIO: Physics-Informed Neural Network for Coupled Hyperelastic Interface-Volume Simulation in Orthognathic Surgery MICCAI2026
链接: https://arxiv.org/abs/2606.01572
作者: Jungwook Lee,Daeseung Kim,Kevin Gu,Zhangfeng Hu,Tianshu Kuang,Finn Hopeman,Michael A.K. Liebschner,Jaime Gateno,Pingkun Yan
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to MICCAI 2026
Abstract:Predicting patient-specific facial soft-tissue deformation is critical for iterative orthognathic surgery planning. However, current computational methods face a strict accuracy-efficiency trade-off: high-fidelity Finite Element Methods (FEM) are computationally prohibitive, whereas pure deep learning models often produce biomechanically inconsistent results. While Physics-Informed Neural Networks (PINNs) offer a promising avenue, learning the complex heterogeneous mechanics of bone–soft-tissue interactions with only partial clinical supervision (i.e., outer facial surfaces) remains highly unstable. To overcome these challenges, we present PINNOCHIO, a novel physics-informed framework for facial soft-tissue simulation. PINNOCHIO introduces a hybrid sequential decomposition that explicitly decouples discontinuous bone–soft-tissue interface movements from continuous volumetric hyperelastic deformation. This structural separation enables stable training and facilitates a physics-enabled sim-to-real adaptation strategy, ensuring internal biomechanical consistency without requiring volumetric ground truth. Evaluated on a 40-patient clinical cohort, PINNOCHIO outperforms existing baselines in both surface accuracy and physical validity. Furthermore, it achieves a substantial speedup over FEM, successfully resolving the accuracy-efficiency trade-off to provide a highly reliable and practical tool for interactive surgical planning.
[CV-328] ResNet-34 with Lightweight Decoder for Accurate and Efficient Segmentation of Fetal Brain MRI
链接: https://arxiv.org/abs/2606.01293
作者: Ashiqur Rahman,Muhammad E. H. Chowdhury,Md. Abu Sayed,Md. Sharjis Ibne Wadud,Abu Naser Md. Arafat,Mehedi Hasan Prince
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate segmentation of fetal brain tissues in Magnetic Resonance Imaging (MRI) is critical for early diagnosis of congenital abnormalities and improving prenatal care. However, the task remains difficult because of fetal motion, low tissue contrast, and major anatomical variability throughout gestational ages, particularly in segmenting complex structures such as white matter, gray matter, lateral ventricles, deep gray matter, extra-cerebrospinal fluid, cerebellum, and brainstem. As a solution to these difficulties, this research introduces a novel deep learning model that combines a ResNet-34 encoder with a lightweight decoder leveraging multi-layer perceptron (MLP) modules for adaptive feature refinement. This design specifically enhances the model’s ability to preserve anatomical boundaries and mitigate segmentation errors caused by motion artifacts and intensity inhomogeneities. Computational efficiency is achieved by reducing parameter count, employing bilinear upsampling instead of transposed convolutions, and optimizing the decoder for speed without sacrificing accuracy. Trained and validated on the FeTA 2021 dataset using 5-fold cross-validation, the proposed model outperforms baseline architectures such as UNet, UNet++, DeepLabV3, and DeepLabV3+, achieving an average Accuracy of 97.37% with a mean Dice Similarity Coefficient (DSC) of 90.33%, mean Intersection over Union (IoU) of 86.93%, and Precision of 90.83%. Additionally, its fast inference time and reduced computational load make it well-suited for integration into real-time clinical workflows.
[CV-329] Differing Roles of Leisure and Productivity in GDP - A Machine Learning based comparative analysis of Germany and USA
链接: https://arxiv.org/abs/2606.01234
作者: Achintya Ranjan,Uma Ranjan
类目: General Economics (econ.GN); Computational Engineering, Finance, and Science (cs.CE); Computer Vision and Pattern Recognition (cs.CV); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
备注: International Conference on Emerging Techniques in Computational Intelligence 2025
Abstract:The GDP of a country is modelled as the relative interaction between two agents - working hours, reflecting the social choice of a population, and Total Factor Productivity, reflecting the collective investment in productivity enhancers. It is shown that a Random Forest model can accu- rately predict the GDP from these two factors. The differences in the choices made by Germany and USA are analysed though Gini importance, SHAP plots and partial dependency. It is shown that the differences in the social structure of the countries are reflected in the relative contribution of working hours and productivity to the GDP.
[CV-330] Generative Diffusion Priors for 3D Mapping of the Dark Universe CVPR2026
链接: https://arxiv.org/abs/2606.00803
作者: Brandon Zhao,Diana Scognamiglio,Olivier Doré,Katherine L. Bouman
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CVPR 2026 (Highlight)
Abstract:Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset \textttConicus3D , which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.
[CV-331] AutoIQ: An Ensemble Framework for Automatic Assessment of Geometric Distortion in Prostate Diffusion-Weighted Imaging
链接: https://arxiv.org/abs/2606.00393
作者: Haoran Sun,Lixia Wang,Yin-Chen Hsu,Hsu-Lei Lee,Chang Gao,Fei Han,Robert Grimm,Vibhas Deshpande,Ziyang Long,Hsin-Jung Yang,Rola Saouaf,Alessandro D’Agnolo,Timothy Daskivich,Hyung Kim,Debiao Li,Yibin Xie
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Original research; 11 pages, 7 figures, 1 table
Abstract:Geometric distortion in prostate diffusion-weighted imaging (DWI) can impair lesion localization and reduce the reliability of MRI-based clinical assessment. We propose AutoIQ, an ensemble machine learning framework for automatic quantification and classification of DWI geometric distortion severity. A total of 140 retrospective prostate biparametric MRI examinations were analyzed, including 33 scans with severe distortion requiring repeat acquisition and 107 scans with acceptable distortion based on expert radiologist assessment. AutoIQ combines two complementary distortion quantification strategies: a segmentation-based method measuring prostate boundary mismatch between T2-weighted imaging (T2WI) and DWI, and a registration-based method estimating deformation magnitude after DWI-to-T2WI alignment. The resulting distortion scores were used to train individual classifiers and a logistic-regression ensemble model. Both computational methods significantly differentiated severe from acceptable distortion cases (p 0.001). On an independent test set, the ensemble model achieved an accuracy of 0.95, F1-score of 0.93, and AUC of 0.98, outperforming individual models. These results suggest that AutoIQ can provide automated, quantitative quality assessment for prostate DWI and may help identify scans that require repeat acquisition.
[CV-332] raining-Free Continuous Bitrate Control for Scalable Image Coding for Humans and Machines
链接: https://arxiv.org/abs/2606.00158
作者: Yui Tatsumi,Hiroshi Watanabe
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous variable-rate compression is highly demanded in real-world applications, but remains underexplored in scalable image coding for humans and machines. In this paper, we propose a training-free variable-rate scalable image coding framework. By adjusting quantization steps based on predicted scale values, the proposed method achieves continuous bitrate control while preserving high-scale information in the machine and enhancement layers. Experimental results demonstrate the effectiveness of the proposed method and highlight the importance of bitrate allocation between the two layers.
[CV-333] Multi-Contrast MRI Motion Correction via Parameter-Informed Disentanglement and Adaptive Experts
链接: https://arxiv.org/abs/2606.00146
作者: Honglin Xiong,Yuxian Tang,Feng Li,Yulin Wang,Lei Xiang,Dinggang Shen,Qian Wang
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Motion artifacts in magnetic resonance imaging (MRI) degrade diagnostic reliability. Existing deep learning methods are typically contrast-specific and fail to generalize across diverse modalities and artifact severities. We propose a unified framework combining parameter-informed contrast disentanglement with severity-aware adaptive correction. ScanCLIP, pretrained on over 30,000 MRI text-image pairs, derives contrast embeddings from acquisition parameters to disentangle contrast style from anatomical content, yielding contrast-free features. A Vision Transformer then estimates motion severity and routes features through a Mixture-of-Experts network, enabling targeted artifact correction. A dual-pathway decoder reconstructs both the clean image and residual artifact map, enforcing image-space consistency. On IXI and HCP benchmarks, our method improves PSNR by 0.75 dB and SSIM by up to 0.0279 over state-of-the-art approaches, with larger gains at higher artifact severities. It further demonstrates robust zero-shot generalization on real-world clinical data acquired with unseen scanning parameters, where existing methods either fail to remove artifacts or introduce additional distortions.
[CV-334] ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression
链接: https://arxiv.org/abs/2606.00111
作者: Haisheng Fu,Runyu Yang,Feng Ding,Siyu Zhu,Jie Liang,Xiaoxiao Li,Zhenman Fang,Jingning Han
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 8 figures, 6 tables
Abstract:State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.
人工智能
[AI-0] Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics
链接: https://arxiv.org/abs/2606.02562
作者: Haimin Hu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)
Abstract:Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot’s ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot’s uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot’s runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.
[AI-1] racking the Behavioral Trajectories of Adapting Agents ICML2026
链接: https://arxiv.org/abs/2606.02536
作者: Jonah Leshin,Manish Shah,Ian Timmis
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure. To appear at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026
Abstract:Text files such as skill files, memory files, and behavioral configuration files play a central role in defining how modern agents act. Through edits by humans or the agents themselves, these files may evolve over time, directly steering the agent’s behavior in future interactions. We present a methodology and framework for measuring agent traits by defining traits as directions in the embedding space of a text embedding model. We train a linear model on labeled “before” versus “after” skill file diffs to learn a trait vector, then score arbitrary skill edits by projecting their embedding diffs onto this vector. Evaluated on 68 labeled skill diff pairs for the trait of propensity to seek sensitive data, our method achieves 91.2% sign classification accuracy and a Spearman rank correlation of \rho = 0.82 under leave-one-out cross-validation. We build this trait evaluation into a broader agent-to-agent protocol that enables one agent to evaluate another’s skill file updates through a trusted intermediary.
[AI-2] Bridging the Last Mile of Time Series Forecasting with LLM Agents
链接: https://arxiv.org/abs/2606.02497
作者: Yuhua Liao,Zetian Wang,Qiangqiang Nie,Zhenhua Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting has advanced rapidly, especially with the emergence of foundation models that show strong zero-shot performance on numerical extrapolation. However, in real-world forecasting settings, a statistically plausible baseline is rarely the final forecast used in practice. Before a forecast becomes decision-ready, it often needs to be revised using weakly structured business context such as holiday effects, campaign plans, external events, historical analogs, and expert feedback. This practical stage remains underexplored in the forecasting literature. In this paper, we formulate this stage as the \textbflast-mile forecasting problem and present an LLM-agent framework that sits on top of a forecasting backbone. Our system maintains a unified forecast workspace, invokes tools to retrieve contextual evidence, and converts reasoning trajectories into explicit forecast revision actions under structural safety constraints. It also supports long-horizon forecasting through map-reduce-style decomposition and post-hoc reflection through a memory bank. The resulting system is designed to be controllable and auditable. Through real-world case studies, we show how LLM agents can bridge the gap between statistical prediction and business-ready forecasting.
[AI-3] Monitoring Agent ic Systems Before Theyre Reliable ICSE
链接: https://arxiv.org/abs/2606.02494
作者: Marisa Ferrara Boston,Glen Hanson,Effi Georgala,JD Hudgens,Heather Frase
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures, 3 tables. Accepted to the Workshop on Agentic Software Engineering (AgenticSE), co-located with ACM CAIS 2026 (non-archival)
Abstract:Agentic systems entering production typically operate as partially integrated assemblies where structural defects, not task-level errors, dominate the failure landscape. At this maturity level, task-level error detection may be infeasible: structural failure modes mask the signal that task-level monitors are designed to this http URL present a monitoring and triage methodology that decomposes agentic system evaluation into three dimensions (quality, suitability, efficiency) at three monitoring scopes (within-run, cross-run, structural), using variance as a characterization signal. Findings are routed through severity classification adapted from FMEA, concentrating human attention on the subset that warrants investigation. We evaluate on a synthetic testbed of 220 runs across 120 document bundles with controlled error this http URL results emerge. Monitor scope determines failure type: within-run monitors surface deterministic stage defects (CV = 0.02), cross-run monitors surface stochastic integration consequences (CV = 1.25, 24% at L2), and a structural monitor identifies an integration gap with perfect consistency (CV = 0.00). Injected task-level errors are indistinguishable from clean baselines, confirming structural defects mask task-level signal. Deterministic triage routes 97% of findings to automated tracking, leaving the 2% reflecting variable behavior for human this http URL propose, on Stage 1 evidence, a maturity-staging model in which monitoring transitions from structural characterization to error detection to reliability tracking as integration defects resolve. The taxonomy, CV-based scope characterization, and severity model transfer architecturally to document-driven, multi-stage agentic workflows in regulated industries; specific calibrations are domain-specific. Deploy monitoring early: the first thing it finds is the most important thing to fix.
[AI-4] RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
链接: https://arxiv.org/abs/2606.02488
作者: Yuyang Li,Zihe Yan,Tobias Käfer
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune’s tokens and also less than the iterative and decomposition retrieval baselines.
[AI-5] Iteris: Agent ic Research Loops for Computational Mathematics
链接: https://arxiv.org/abs/2606.02484
作者: Leheng Chen,Zihao Liu,Wanyi He,Bin Dong
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 43 pages
Abstract:Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.
[AI-6] MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation ICML2026
链接: https://arxiv.org/abs/2606.02470
作者: Wenhao Wang,Peizhi Niu,Gongyi Zou,Xiyuan Yang,Jingxing Wang,Haoting Shi,Yaxin Du,Jingyi Chai,Xianghe Pang,Shuo Tang,Yanfeng Wang,Siheng Chen
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026 Camera Ready
Abstract:The Model Context Protocol (MCP) has emerged as a transformative standard for connecting large language models (LLMs) with external data sources and tools, and has been rapidly adopted across personal applications and development platforms. However, existing benchmarks predominantly focus on generic information-seeking tools and fail to capture the practical challenges posed by personal social applications, where tools interact with individual accounts or local databases. To bridge this critical gap, we introduce MCP-Persona, the first benchmark specifically designed for evaluating agent performance on real-world, personalized MCP tools. MCP-Persona encompasses a diverse set of widely-used applications, ranging from social media platforms like Reddit and Xiaohongshu (Rednote) to enterprise collaboration suites such as Lark (Feishu) and Slack. Our extensive experiments on various state-of-the-art (SOTA) agents demonstrate their significant struggles with personalized tool use, thereby highlighting the benchmark’s crucial role in identifying and addressing these limitations. MCP-Persona is publicly available at this https URLthis https URL.
[AI-7] Beyond One-shot: AI Agents for Learning in Field Experiments
链接: https://arxiv.org/abs/2606.02458
作者: Junjie Luo,Ritu Agarwal,Gordon Gao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Organizations routinely run experiments for A/B testing, yet the data generated from one experiment is underutilized to inform subsequent intervention design. Significant barriers exist to extracting actionable knowledge from prior experimental data to inform new interventions. We study whether tool-augmented agentic AI can automatically learn from experimental data to generate new interventions in subsequent experiments. Through two-stage field experiments in healthcare prescription messaging (693,139 patient visits), we compare a Human + Chatbot method (Stage 1: behavioral experts with conversational AI co-designing 13 message variants, 444,691 patient visits) against a Tool-Augmented Agentic AI method (Stage 2: AI autonomously extracting principles from Stage 1 data to generate 17 new variants, 248,448 patient visits). The Agentic AI method, equipped with analytical tools, structured Data-Information-Knowledge-Wisdom (DIKW) reasoning agents, and transparent evidence chains, produces superior interventions: the best AI-generated message achieved a 69.8% CTR (+6.5 percentage points over baseline). Critically, our results suggest that the value comes from domain-specific experimental data, not from general reasoning ability: frontier LLMs operating without experimental data failed to predict which interventions would succeed. The field experiments also revealed that general-purpose behavioral theories used for intervention design do not extend uniformly to specific healthcare contexts, motivating an agentic AI approach to theory audits at field-experiment scale. Our research shows that tool-augmented AI can learn from experimental data and generate improved domain-relevant interventions, transforming behavioral experimentation from one-shot evaluation into a scalable system for cumulative design learning.
[AI-8] LLM -Evolved Pattern Generators for Optimal Classical Planning
链接: https://arxiv.org/abs/2606.02438
作者: Windy Phung,Dominik Drexler,Arnaud Lequen,Jendrik Seipp
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Existing approaches, however, focus on improving search guidance rather than guaranteeing admissibility, which makes them unsuitable for optimal classical planning. We present the first method for learning domain-dependent heuristics that are admissible by design and thus preserve the optimality guarantees of A* search. Instead of learning a direct mapping from states to heuristic values, we learn to construct abstractions that induce admissible heuristics. We use an LLM-driven evolutionary program-synthesis framework to obtain, for each domain, a program that produces a pattern collection for any task in that domain, and we combine the resulting patterns admissibly via saturated cost partitioning. Empirically, the learned programs encode interpretable domain-specific insights, run with negligible overhead at test time and yield heuristics that match the coverage of state-of-the-art domain-independent baselines on several domains while evaluating each state substantially faster.
[AI-9] Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization
链接: https://arxiv.org/abs/2606.02434
作者: Yusuke Ohtsubo,Kota Dohi,Koichiro Yawata,Koki Takeshita,Tatsuya Sasaki
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Precise parametric control over circuit geometry is essential for semiconductor inspection, yet obtaining sufficient real training data remains costly. Although generative models such as diffusion models and Generative Adversarial Networks (GANs) can augment training data, they cannot guarantee the nanometer-scale geometric accuracy required for metrology tasks. We propose a visual program synthesis framework in which a Vision-Language Model (VLM) converts inspection images into editable Domain-Specific Language (DSL) code describing circuit geometries, enabling controlled generation of training data with exact parameter manipulation. Because the VLM is trained solely on synthetic DSL-rendered data, a domain gap arises when processing real Scanning Electron Microscope (SEM) images. We bridge this gap with an input binarization strategy that strips SEM-specific texture and noise, letting the model focus on geometric structure. On the MIIC dataset, binarized inputs improve the mean Dice coefficient from 0.4393 to 0.5256 over the raw-input baseline, demonstrating that simple texture abstraction substantially mitigates the sim-to-real gap.
[AI-10] Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
链接: https://arxiv.org/abs/2606.02430
作者: Yafan Huang,Sheng Di,Guanpeng Li
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: Accepted at ICS’26
Abstract:Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.
[AI-11] Policy and World Modeling Co-Training for Language Agents
链接: https://arxiv.org/abs/2606.02388
作者: Ning Lu,Baijiong Lin,Shengcai Liu,Jiahao Wu,Haoze Lv,Yanbin Wei,Lingting Zhu,Shengju Qian,Xin Wang,Ying-Cong Chen,Qi Wang,Ke Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, 6 figures
Abstract:Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.
[AI-12] Agent PLM: Agent ic Protein Language Models with Reasoning -Augmented Decoding for Protein Sequence Design
链接: https://arxiv.org/abs/2606.02386
作者: Sahil Rahman,Maxx Richard Rahman
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Protein language models (PLMs) are passive oracles: they generate sequences in a single forward pass with no mechanism to consult external biophysical feedback or redirect generation when a candidate violates thermodynamic or structural constraints. We introduce AgentPLM, which addresses this by equipping a pre-trained PLM with i) Reasoning-Augmented Decoding (RAD), which interleaves autoregressive generation with tool calls (ESMFold, FoldX, AutoDock Vina), and ii) Contrastive Agent Policy Optimisation (CAPO), a trajectory-level extension of direct preference optimisation that trains the policy end-to-end to learn when oracle feedback is informative rather than merely imitating high-fitness sequences. We evaluate AgentPLM on benchmark tasks spanning de novo enzyme design, antibody optimisation, thermostability, PPI interface design, and zero-shot fitness prediction with standardised oracle APIs and controlled sequence-identity splits. AgentPLM achieves state-of-the-art results with a gain in antibody top-10% hit rate over the strongest passive baseline, providing mechanistic evidence of online error correction without explicit backtracking.
[AI-13] A Mathematical Conflict Framework for Contextual Data Modulation
链接: https://arxiv.org/abs/2606.02381
作者: Hakan Emre Kartal
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Dynamical Systems (math.DS)
备注: 15 pages, 3 figures, framework paper
Abstract:In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies between raw data and contextual data. The proposed structure treats conflict as a local, directional, and context-sensitive quantity, integrating components such as weighting, scale behavior, and output mapping under a unified abstract operator. Without being reduced to a specific learning algorithm or optimization method, the framework is defined as a general structure adaptable to different classes of problems. While existing approaches typically treat conflict merely as an implicit side effect embedded within the optimization process, the proposed framework considers conflict as an independent, operator-based, and component-level mathematical object.
[AI-14] When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
链接: https://arxiv.org/abs/2606.02378
作者: Yongzhong Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 2 figures
Abstract:We track the developmental trajectory of attention-head circuit formation across three 1B-class language models spanning two architecture families (dense transformer, mixture-of-experts) and two pretraining corpora (The Pile, DCLM): Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. At each of 10 log-spaced revisions per model – 30 mechanistic-interpretability runs in total – we apply a participation-ratio (PR) spectral signal and an all-head capability-specific selectivity screen to track induction, previous-token, and BOS-attractor heads as they emerge. Five findings. (F1) Layers 0 and 1 produce zero BOS-classified heads at every revision in every model: the L0/L1 zero-BOS floor is an architectural property, not a learned outcome. (F2) The whole-model BOS-attractor fraction follows three distinct emergence shapes – a gradual ramp in Pythia 1B, a sharp phase transition in OLMo 1B (7% to 70% between adjacent checkpoints), and a gradual ramp in OLMoE 1B-7B. (F3) In DCLM models, induction-circuit formation precedes BOS-attractor formation by 10-20x in tokens; capability-circuit formation and attention-sink formation are two transitions, not one. (F4) The capability-specific screen converges to the final induction circuit within 0.3-2% of total training tokens – circuit identification does not require the final model. (F5) For every final-checkpoint induction head sampled across all three models, per-head PR is elevated at or before the first revision at which that head crosses its capability-selectivity threshold. The results refine the induction-phase-transition framing: in 1B-class models trained on DCLM, the induction transition and the attention-sink transition are separated by an order of magnitude in tokens and have qualitatively different shapes. Comments: 22 pages, 2 figures Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.02378 [cs.LG] (or arXiv:2606.02378v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.02378 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-15] Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models
链接: https://arxiv.org/abs/2606.02374
作者: Steffen Knoblauch,Hao Li,Gengchen Mai,Konstantin Klemmer,Song Gao,WenWen Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Earth Observation (EO) has fundamentally transformed the monitoring of environmental processes and human activities up to planetary scale. Recent advances in self-supervised learning have given rise to Earth Observation Foundation Models (EOFMs), which leverage petabyte-scale unlabeled EO data to learn transferable representations across a wide range of downstream geospatial tasks. Despite these advances, current EOFMs remain largely confined to raster modalities, overlooking the rich, structured information encoded in openly-accessible vector data sources such as OpenStreetMap and Overture. Vector data provides explicit and compact representations of geographic entities, including geometry, topology, and semantic relationships, offering critical contextual signals that are often ambiguous or inaccessible in imagery alone. Raster and vector data thus represent complementary views of geographic space: raster data captures continuous physical and spectral patterns, while vector data encodes discrete objects and their relational structure and often represents more of the human rather than the physical systems (e.g. social or demographic data). However, existing geospatial representation learning paradigms treat these modalities in isolation, relying on imperfect and often lossy transformations to bridge them. This perspective paper calls for a paradigm shift toward joint Spatial Representation Learning (SRL) in an unified embedding space that integrate raster perception with vector-based reasoning. Building on emerging efforts in multimodal geospatial learning, we highlight conceptual foundations, technical challenges, and promising directions for aligning heterogeneous spatial data sources. We contend that such integration is essential for developing next-generation geospatial AI systems capable of more accurate, interpretable, and semantically grounded understanding of the Earth.
[AI-16] FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo ICML2026
链接: https://arxiv.org/abs/2606.02365
作者: Kyunghun Nam,Sumyeong Ahn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages, ICML 2026 camera-ready version
Abstract:Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.
[AI-17] MOC: Multi-Order Communication in LLM -based Multi-Agent Systems
链接: https://arxiv.org/abs/2606.02359
作者: Yao Guan,Lin Wang,Zhihu Lu,Ziyi Wang,Wenzhu Yan,Qiang Duan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at this https URL.
[AI-18] SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
链接: https://arxiv.org/abs/2606.02355
作者: Zhongyu He,Yuanfan Li,Fei Huang,Tianyu Chen,Siyuan Chen,Xingyang Li,Meng Hsuan Yu,Xiangrong Liu,Leyi Wei,Lu Pan,Ke Zeng,Xunliang Cai
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at this https URL.
[AI-19] Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
链接: https://arxiv.org/abs/2606.02337
作者: Santiago Amaya-Corredor,Miguel Calvo-Fullana,Anders Jonsson
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the Reinforcement Learning Conference (RLC) 2026. 40 pages (12 main + 28 appendix), 5 figures, 16 tables, 7 theorems
Abstract:Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective–constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.
[AI-20] Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions
链接: https://arxiv.org/abs/2606.02326
作者: Yifan Wang
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures
Abstract:Hard constraints are usually treated as terminal vetoes: once a candidate violates a requirement, the learned rule rejects it and any repair is handled outside the decision semantics. This misses a common deployed regime in which the system already knows a finite menu of modifications, such as adding a ticket option, changing a configuration, or requesting an available service upgrade. Existing constraint-learning, soft-relaxation, and recourse methods address nearby problems, but they do not learn whether an option should be repaired before being vetoed. We introduce Repair-Augmented Constraint Learning (RACL), a contextual decision framework that lifts known repair operators into the classifier semantics. A candidate is accepted when an affordable repair makes it feasible and preferred enough; otherwise the system returns a structured rejection credit and, when applicable, a repair plan. This repair-before-veto view strictly generalizes no-repair HASSLE-style semantics, reveals an irreducible false-veto gap for terminal-veto rules, separates binary-label non-identifiability from decision-rule learnability, and gives capacity and calibration bounds for the observed-feasibility shared-weight setting. Across controlled and DB1B-derived benchmarks, RACL recovers the intended credit and repair structure. On the hardest raw-data-derived tier, validation-selected RACL reduces false vetoes to 10/4039 (FVR 0.0025), versus about 1064/4039 for the strongest repair-search black-box baseline, while making the FVR/EDR trade-off explicit.
[AI-21] Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment
链接: https://arxiv.org/abs/2606.02322
作者: Ran Liu,Min Yu,Mingqi Liu,Jianguo Jiang,Gang Li,Rongsheng Li,Ning Li,Zhen Xu,Weiqing Huang,Ming Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.
[AI-22] SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents
链接: https://arxiv.org/abs/2606.02302
作者: Hao Cheng,Changtao Miao,Tianle Song,Yin Wu,He Liu,Erjia Xiao,Junchi Chen,Xiaoyu Shi,Yichi Wang,Jing Yang,Taowen Wang,Jinhao Duan,Mengshu Sun,Peiyan Dong,Xuan Shen,Yang Cao,Renjing Xu,Kaidi Xu,Jindong Gu,Bo Zhang,Jize Zhang,Chenhao Lin,Philip Torr,Chao Shen
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous LLM agents increasingly operate in stateful environments where they access tools, files, memory, and external services. While such capabilities enable complex real-world workflows, they also introduce security risks that are difficult to capture with existing evaluations. Current agent security benchmarks often rely on manually curated tasks, provide limited coverage of emerging threats, and focus primarily on final outcomes rather than the execution processes that lead to unsafe behavior. We introduce SeClaw, a framework that combines specification-driven security task synthesis with execution-based security evaluation for Autonomous agents. Spec-driven security task synthesis enables scalable and controllable construction of security tasks from structured risk specifications, while SeClaw docker provides a standardized testbed for evaluating agent behavior under diverse safety-risk scenarios. The benchmark covers risks arising from resources, user tasks, environments, and intrinsic agent behaviors, and supports trajectory-aware assessment of unsafe actions beyond final responses. By bridging systematic task synthesis and reproducible security evaluation, SeClaw provides a practical foundation for measuring, diagnosing, and comparing security failures in autonomous LLM agents. The code is available at this https URL.
[AI-23] CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation
链接: https://arxiv.org/abs/2606.02287
作者: Shibo Zhu,Xiaodan Shi,Dayin Chen,Yuntian Chen,Haoran Zhang,Tianhao Wu,Jinyue Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systematic comparison across trajectory generation methods remains difficult because existing studies often rely on different datasets, preprocessing pipelines, trajectory representations, and evaluation metrics. This fragmentation makes it unclear whether reported performance differences arise from the generation mechanism itself or from inconsistent experimental protocols. To address this issue, we present CityTrajBench, a unified benchmark framework and protocol for city-scale vehicle trajectory generation. CityTrajBench standardizes data ingestion, trajectory normalization, feature construction, model adaptation, map-aware post-processing, model selection, and multi-level evaluation under a common setting. It supports heterogeneous generators, including statistical baselines, VAE-based, GAN-based, diffusion-based, and flow-matching-based models, and evaluates them on three real-world urban trajectory datasets. The benchmark measures global spatial realism, trip-level distribution fidelity, trajectory-level geometric similarity, conditional mobility consistency, and efficiency. Experiments reveal clear trade-offs across model families: DiffTraj is strongest on trajectory-level geometric fidelity, DiffRNTraj is competitive on structure-sensitive global realism, and TrajFlow provides a strong balance across realism, quality, conditional consistency, and efficiency. Meanwhile, a simple Markov baseline remains competitive on coarse-grained trip and local-movement statistics. These findings show that urban trajectory generation quality is inherently multi-objective, that no single model dominates all criteria equally, and that CityTrajBench provides a reproducible benchmark protocol and testbed for future research on urban mobility generation.
[AI-24] POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
链接: https://arxiv.org/abs/2606.02282
作者: Iñaki Dellibarda Varela,R. Sendra-Arranz,Pablo Romero-Sorozabal,J.M. Valverde-García,Annemarie F. Laudanski,Álvaro Gutiérrez,Eduardo Rocon,Manuel Cebrian
类目: Artificial Intelligence (cs.AI)
备注: 44 pages, 6 figures
Abstract:Orchestrating Large Language Models into Multi-Agent Systems (LLM-MAS) has unlocked remarkable reasoning capabilities, yet emergent failures and hallucinations that resist characterisation block their deployment in safety-critical domains – a gap made legally untenable by emerging AI regulation. Existing evaluation paradigms share a common flaw: centralised judgment creates single points of failure and demands domain-specific expertise. Here we present POIROT, a protocol that repurposes a system’s own agents as its diagnostic layer, leveraging the epistemic diversity already present in the architecture. Across evaluated settings, POIROT outperforms single-LLM evaluator baselines, with gains that scale with problem complexity (OR = 1.60, p = 0.008 ), agent count, and fault dimensionality, persisting under compound fault conditions. These results demonstrate that safety oversight need not be externalised: the agents executing a role carry sufficient collective intelligence to audit it. We release POIROT as an open-source library alongside BLAME, a benchmark for fault attribution in safety-critical multi-agent systems.
[AI-25] CEON: Circular Economy Ontology Network
链接: https://arxiv.org/abs/2606.02253
作者: Huanyu Li,Els de Vleeschauwer,Robin Keskisärkkä,Mikael Lindecrantz,Mina Abd Nikooie Pour,Ying Li,Ben De Meester,Patrick Lambrix,Eva Blomqvist
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Increasing the circularity of resource use in our society has been recognized as a path to sustainability, i.e., transitioning into a more circular economy. There are many different circular strategies to do so, such as reusing products and components, refurbishing and remanufacturing used products, or recycling left-over or used materials. To enable these strategies, it is necessary to share information at the infrastructure level and to communicate between industry sectors along the product life cycle. Enabling semantic interoperability in this information sharing and communication is therefore a key to increasing circularity. However, knowledge representation for the circular economy (CE) domain, which involves many relevant industry sectors related to product life cycles, remains challenging. To bridge this gap, we developed the Circular Economy Ontology Network (CEON) within the Onto-DESIDE project. This ontology network aims to fill gaps in CE by defining cross-sectorial concepts and to enable semantics-aware data documentation. We demonstrate CEON through cross-industry data documentation scenarios spanning construction, electronics, and textile sectors.
[AI-26] FW-NKF: Frequency-Weighted Neural Kalman Filters ICRA2026
链接: https://arxiv.org/abs/2606.02251
作者: Adnan Harun Dogan,Berken Utku Demirel,Christian Holz
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Published at ICRA 2026
Abstract:Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.
[AI-27] Faster Synchronous On-Policy RL via Strag gler-Aware Group Sizing
链接: https://arxiv.org/abs/2606.02218
作者: Azal Ahmad Khan,Ammar Ahmed,Zeshan Fayyaz,Sheng Di,Mingyi Hong,Ali Anwar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.
[AI-28] On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching ICML
链接: https://arxiv.org/abs/2606.02179
作者: Mohammad Rashed,Duarte F. Valoroso Madeira,Babak Gholami,Caglar Guerbuez,Yunjia Yang,Nils Thuerey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: ICML Paper
Abstract:Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce \textbfpseudo-sensitivities to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations. Code and datasets are available at this https URL .
[AI-29] From Capability Models to Automated Planning : An AAS-Native Approach for Automatic PDDL Generation
链接: https://arxiv.org/abs/2606.02167
作者: Hamied Nabizada,Thomas Wirt,Luis Miguel Vieira da Silva,Felix Gehlhoff,Alexander Fay
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)
Abstract:Engineers designing production systems need to verify that a given layout supports all required production sequences. Automated planning techniques can answer such questions, but formulating the required planning problems in the Planning Domain Definition Language (PDDL) demands specialized expertise that production engineers typically lack. Asset Administration Shells (AAS) have emerged as the standardized Digital Twin for industrial assets in Industry 4.0. We show that AAS capability models, structured using four established Industry 4.0 standards (VDI 3682 for process descriptions, IEC 61360-1 for semantic property qualification, IDTA 02011 for type hierarchies, and IDTA 02016 for instance descriptions), contain sufficient information to generate complete PDDL problems automatically. Unlike prior work that introduced PDDL-specific submodels, our approach derives all planning elements from domain-level descriptions of resource functions, so-called capabilities, allowing engineers to model capabilities without any exposure to PDDL syntax or planning concepts. Our extraction algorithm transforms distributed Multi-AAS architectures into complete PDDL planning problems. We validate the approach on AAS models of a laboratory production system, comparing four layout variants using optimal planning to demonstrate how engineers can systematically explore design trade-offs by modifying the AAS model and regenerating the planning domain
[AI-30] An Abstract Worlds Semantic Framework for Belief Change Operators
链接: https://arxiv.org/abs/2606.02163
作者: Daniel Grimaldi,M. Vanina Martinez,Ricardo O. Rodriguez
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This article proposes a set-theoretic framework for belief change, called Abstract Worlds Semantics, in which no logical syntax is assumed. Inspired by Grove’s (1988) results, our approach treats worlds as primitive elements, over which world contraction and world revision operators are defined. This semantic framework enables a unified analysis of belief change models. Within this framework, we unify classical and non-prioritized belief change constructions by defining versatile operators. When classical propositional logic is considered, our framework provides a homogeneous account of AGM, KM, and Multiple Change models. In summary, AWS systematizes belief change frameworks and operators, simplifying and generalizing belief change theory over belief sets.
[AI-31] S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty
链接: https://arxiv.org/abs/2606.02151
作者: Fabio Pavirani,Bert Claessens,Pierre Pinson,Chris Develder
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:Effective scheduling in the energy sector is essential to ensure the reliable operation of electrical grids and their connected assets by, for instance, optimizing the dispatch of generation units and storage systems. An effective planning strategy must (a) accommodate advanced and potentially non-linear system models – exploiting the increasing data availability of modern grids, and (b) explicitly handle uncertainties arising, for instance, from the integration of renewable energy sources. While existing approaches can address either non-linearity (e.g., Monte Carlo Tree Search) or uncertainty (e.g., stochastic mathematical optimization), there is a lack of planning techniques capable of addressing both challenges simultaneously. To bridge this gap, we propose a Stochastic Scenario-Structured Tree Search (S3TS) algorithm that explicitly represents uncertainty through scenario trees while enabling the integration of advanced non-linear models. We evaluate S3TS on a simulated demand response signal publication problem, largely mimicking the imbalance settlement mechanism in Belgium. The results demonstrate near-optimal performance in linear, analytically tractable settings, with costs within 14% of the mathematically optimal solution conditioned to the scenario trees. In highly non-linear scenarios, S3TS significantly outperforms baseline methods, achieving cost reductions of up to 51% and 5.4% compared to a myopic algorithm and deterministic MCTS, respectively.
[AI-32] VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting
链接: https://arxiv.org/abs/2606.02138
作者: Xudong Zhang,Jierui Lei,Jiacheng Li,Lingdong Shen,Jian Cui,Haina Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08% and 7.74% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at this https URL.
[AI-33] Variational Learning for Insertion-based Generation
链接: https://arxiv.org/abs/2606.02133
作者: Yangtian Zhang,Zhe Wang,Arthur Gretton,Rex Ying,David van Dijk,Michalis K. Titsias,Jiaxin Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.
[AI-34] Learning When Not to Act: Mitigating Tool Abuse in Agent ic Reinforcement Learning
链接: https://arxiv.org/abs/2606.02132
作者: Liuji Chen,Dianxing Tang,Xing Shi,Dingshuo Chen,Qiang Liu,Shu Wu,Liang Wang
类目: Artificial Intelligence (cs.AI)
备注: Under reivew
Abstract:Agentic reinforcement learning can induce tool abuse, where models overuse external tools even for queries solvable by internal reasoning. Existing approaches mitigate this issue with uniform tool-use penalties or hard limits, which reduce tool frequency but may also suppress useful tool-assisted exploration. We propose EAPO, an Efficient Agentic Policy Optimization framework that learns selective tool use. EAPO introduces tool-free trajectories into each rollout group, applies difficulty-aware reward shaping to penalize redundant tool calls mainly on easier queries, and uses confidence-aware token reweighting to improve policy learning. Across nine mathematical and knowledge-intensive reasoning benchmarks, EAPO consistently improves the accuracy efficiency trade-off on Qwen2.5-3B, Qwen2.5-7B, and Llama3.1-8B. Compared with GRPO, EAPO improves average performance by 10.45%, 7.27%, and 9.69%, while reducing average tool calls by 18.33%, 18.33%, and 24.59%, respectively. These results show that agents can learn when not to use tools without compromising tool-integrated reasoning.
[AI-35] How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning ICML2026
链接: https://arxiv.org/abs/2606.02119
作者: Jiangwei Chen,Xinyuan Niu,Rachael Hwee Ling Sim,Zhengyuan Liu,Nancy F. Chen,Bryan Kian Hsiang Low
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaining the model performance on the remaining retain data. Existing unlearning algorithms, such as optimizing a weighted combination of losses, have tried to achieve these objectives of improving forget quality and maintaining retain utility. However, they do not guarantee that these objectives can be improved by a specified extent for all forget and retain data. In this work, we address this limitation with a novel and theoretically-grounded approach from a constrained optimization perspective. Firstly, we identify that the hardness of reconciling both objectives can be quantified by the similarity between the forget data and the retain data. Next, we derive an unlearning algorithm (HAMU) with the overall goal of guaranteeing a specified improvement in forget quality while minimizing the retain utility cost/degradation by updating the model weights based on our hardness measure. Our hardness measure also informs users when retain utility degradation is unavoidable, i.e., both objectives cannot be improved simultaneously, and stopping should be considered. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU’s superior performance over baselines on both image and text datasets using large models. Our code is available at this https URL.
[AI-36] BADGER: Bridging Agent ic and Deterministic Evaluation for Generative Enterprise Reasoning
链接: https://arxiv.org/abs/2606.02109
作者: Shannon Serrao,Soumitra Chatterjee,Dorina Strori,Abhishek Sharma,Nathan Miller
类目: Artificial Intelligence (cs.AI)
备注: 30 pages, 2 figures, 6 tables
Abstract:Enterprise AI systems that translate natural language into SQL queries and orchestrate multi-step agentic reasoning pipelines require evaluation approaches fundamentally different from academic benchmarks. Spider and BIRD established execution-accuracy protocols; G-Eval and RAGAS advanced LLM-based assessment; and recent work such as Spider 2.0, BEAVER, and BIRD-Interact has begun to address enterprise and agentic dimensions. No single framework unifies text-to-SQL assessment with agentic behavior evaluation into a production-grade pipeline calibrated against human expert judgment. We present BADGER, developed at Merkle, a unified evaluation framework integrating text-to-SQL assessment with agentic behavior evaluation. BADGER offers three contributions. First, LLM-assisted SQL component extraction extending Spider methodology to handle CTE-heavy, dialect-specific SQL. Second, a hybrid execution accuracy metric (Hybrid-EX) resolving column-aliasing and numeric-tolerance brittleness by using an LLM to infer structural alignments before deterministic cell-level scoring. Validated on 150 human-annotated industry queries, Hybrid-EX achieves Cohen’s kappa=0.717 [95% CI: 0.600-0.822] (Substantial agreement) and 87.3% balanced accuracy, outperforming all six competing frameworks (Delta-kappa: 0.322-0.502, all p=0.001). Third, an enterprise agentic evaluation suite assembling RAGAS, G-Eval, and agent benchmark metrics into a unified pipeline; Excess Tool Usage is the sole novel element. BADGER runs entirely within the client’s governed data environment, supports configurable LLM judge backends, and enables rapid prototyping of client-specific judges and metrics, serving as a continuous evaluation backbone rather than a one-time quality gate. Comments: 30 pages, 2 figures, 6 tables Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.2.3; H.3.3 Cite as: arXiv:2606.02109 [cs.AI] (or arXiv:2606.02109v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.02109 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-37] Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters
链接: https://arxiv.org/abs/2606.02107
作者: Youssef Mahran,Zeyad Gamal,Aamir Ahmad,Ayman El-Badawy
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore
Abstract:This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.
[AI-38] Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
链接: https://arxiv.org/abs/2606.02060
作者: Jiaming Wang,Ziteng Feng,Jiangtao Wu,Ruihao Li,Qianqian Xie,Yuxiang Ren,He Zhu,Xueming Han,Fanyu Meng,Junlan Feng,Jiaheng Liu
类目: Artificial Intelligence (cs.AI)
备注: 28 pages, 11 figures, 4 tables
Abstract:Deep-research agents solve tasks through long trajectories of search, tool use, evidence inspection, and answer synthesis. Evaluation based on final answers shows whether an agent succeeds, but not which parts of the trajectory make the answer unreliable. We study span-level error localization for deep-research agents. We collect 2,790 real trajectories from two agent frameworks, three backbone models, and three benchmarks, convert raw logs into semantic spans, and annotate harmful error spans through LLM-assisted expert review. From these annotations, we build TELBench, a 1,000-instance benchmark for identifying error spans among normal exploration, failed searches, tentative hypotheses, and harmless noise. We further propose DRIFT, a claim-centric auditing framework that tracks agent claims, checks their support in trajectory evidence, and marks spans where unsupported or conflicting claims affect the answer path. Experiments across model families and auditing frameworks show that DRIFT improves span-level error localization and first-error accuracy by up to 30 percentage points. Our work provides a process-level view of reliability in deep-research agents.
[AI-39] MoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion
链接: https://arxiv.org/abs/2606.02054
作者: Xiang Li,Jiwei Wei,Ke Liu,Yitong Qin,Jinyu Guo,Malu Zhang,Peng Wang,Yang Yang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) achieve impressive performance on multi-step reasoning tasks, their reliability is persistently hindered by critical limitations such as unconstrained hallucinations and poor numerical computation. Fundamentally, these issues arise because standard models treat reasoning as a transient, one-off generation process rather than retaining and refining successful procedural logic. To address these challenges, we propose eMoT (evolving Memory-of-Thought), a unified framework that stabilizes multi-step reasoning by treating reasoning trajectories as dynamic, evolving memories rather than static templates. The framework primarily consists of three interconnected modules: (i) a memory corrosion mechanism that reinforces high-utility reasoning structures while gradually decaying less frequent ones; (ii) a symbolic anchoring engine that utilizes Python for deterministic computation, much like a human uses a calculator; and (iii) a consistency-driven refinement process that aligns neural inference with symbolic outcomes, reducing the accumulation of logical discrepancies. Across multiple reasoning benchmarks, eMoT improves accuracy and solution consistency over standard Chain-of-Thought and structured reasoning this http URL the traditional task Game of 24, eMoT achieves 100% accuracy, surpassing the baseline by up to 17.6%. Evaluations on mathematical task GSM8K, ASDiv, SVAMP, and MGSM further show consistent gains in multi-step mathematical reasoning. In our evaluation, we achieve superior performance despite utilizing a lightweight backbone model with constrained baseline capabilities. Compared to alternative methods that rely on massively scaled models, our results demonstrate that the performance gains are fundamentally driven by the eMoT framework’s reasoning control rather than sheer model size.
[AI-40] Explainable Data-driven Deep Reinforcement Learning Methods for Optimal Energy Management in Buildings
链接: https://arxiv.org/abs/2606.02049
作者: Hallah Shahid Butt,Qiong Huang,Gökhan Demirel,Kevin Förderer,Erfan Tajalli-Ardekani,Simnon Waczowicz,Luigi Spatafora,Veit Hagenmeyer,Benjamin Schäfer
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing integration of renewable energy sources into power systems, particularly in buildings equipped with photovoltaic (PV) panels and energy storage systems, introduces significant complexity in energy systems. Volatile power generation, varying electricity tariffs, and increased entities, e.g., PV systems, and heat pumps, have increased the complexity and made the system harder to operate. This leads to the demand for additional control and optimization routes including data-based controls, such as reinforcement learning. While deep reinforcement learning (DRL) has emerged as a promising solution to optimize building operations in dynamic and ever more complex environments, its black-box nature impedes user trust and practical adoption. This paper presents a framework for explainable deep reinforcement learning (XRL) applied to energy management in residential buildings. We demonstrate its usage on both synthetic data but also on real-world data from the Living Lab Energy Campus (LLEC) at KIT. We train and compare both on-policy and off-policy DRL agents on an expanded state space that incorporates real-time measurements (demand, PV generation, battery power, state of charge), external signals (dynamic electricity price, local weather data), calendrical and holiday indicators, and forecasts for demand and price. Our experimental results indicate that on-policy algorithms, particularly Advantage Actor Critic (A2C) and Proximal Policy Optimization (PPO), outperform off-policy methods in terms of cumulative rewards and policy stability. To explain these models, we employ post-hoc interpretation techniques to elaborate the learned control policies. Our findings demonstrate that the XRL framework not only reduces electricity costs through optimal battery management, but also provides transparent, actionable insights into the agent’s decision-making process.
[AI-41] RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
链接: https://arxiv.org/abs/2606.02035
作者: Yogesh Kumar Meena,Saurabh Agarwal,K.V. Arya
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports
[AI-42] Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
链接: https://arxiv.org/abs/2606.02011
作者: Ekaterina Alimaskina,Darya Rudas,Denis Shveykin,Gleb Molodtsov,Pavel Vasiliev,Aleksandr Beznosikov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: this https URL.
[AI-43] Why Do Time Series Models Need Long Context Windows?
链接: https://arxiv.org/abs/2606.01999
作者: Luca Butera,Giovanni De Felice,Andrea Cini,Cesare Alippi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length P , an input window size strictly larger than P is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.
[AI-44] An NLP-Driven Framework for Curriculum-Labor Market Alignment: Schema-Constrained LLM Extraction ESCO-Anchored Semantic Matching and Multi-Dimensional Gap Quantification
链接: https://arxiv.org/abs/2606.01982
作者: Sherzod Turaev,Mary John,Mamoun Awad,Nazar Zaki,Khaled Shuaib
类目: Artificial Intelligence (cs.AI)
备注: 53 pages, 9 figures, 4 tables
Abstract:Schema-constrained information extraction from diverse educational and labor-market corpora remains an open challenge in natural language processing because existing pipelines rely primarily on lexical-surface methods that cannot recover implicit competencies, lack grounding in shared taxonomies, and provide no formal measures of extraction reliability or document-level completeness. To address these limitations, this paper proposes a four-stage NLP framework that combines (i) schema-constrained prompting of a two-model frontier-LLM ensemble against a JSON Schema-enforced seven-slot competency formalism, (ii) Sentence-BERT (SBERT) alignment of the extracted records against an eleven-domain ESCO v1.2.1 controlled vocabulary, (iii) a two-tier adjudication protocol that resolves inter-model disagreements, and (iv) a verification mechanism that combines per-slot Cohen’s kappa, schema conformance, and document-level completeness audits. The framework is instantiated for a critical application in higher-education quality assurance, namely curriculum-labor market alignment for the ABET-accredited BSc Computer Science program at the United Arab Emirates University. The pipeline extracts 400 competency records from the 85-course 2025-2026 study plan and aligns them, under a five-scope analysis ranging from the computing core to a probability-weighted student trajectory, with 30 job postings (483 requirement clauses) at an SBERT cosine threshold of 0.50. The extractor achieves Cohen’s kappa of 0.79 on the skill slot, with 100% schema conformance and 100% document-level completeness. The alignment surfaces interpretable supply-demand gaps of 25.0% in general and transversal skills, 13.8% in algorithms and computational theory, and 12.2% in software engineering and project management, with a near-zero 1.8% gap in artificial intelligence and data science despite 38.6% supply coverage.
[AI-45] Algorithmic algorithm development with LLM s: A Case Study on LLM -Usage for Contraction Order Optimization in Tensor Networks
链接: https://arxiv.org/abs/2606.01975
作者: Fabian Hoppe,Melven Röhrig-Zöllner,Philipp Knechtges
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Submitted to the proceedings of the deRSE26 conference
Abstract:We consider LLM-based algorithm development through a case study on contractionorder optimisation for tensor networks with OpenEvolve. We pay particular attention to the choice of the LLM as well as design choices such as evaluation metric and test instances. Our results highlight both the promise of verifier-guided evolutionary coding agents for algorithm development/improvement and the continuing importance of evaluation, validation, and interpretation – and corresponding challenges – by the human scientist.
[AI-46] AutoMedBench: Towards Medical AutoResearch with Agent ic AI Models
链接: https://arxiv.org/abs/2606.01961
作者: Junqi Liu,Salena Song,Yuhan Wang,Jiawei Mao,Hardy Chen,Xiaoke Huang,Tianhao Qi,Pengfei Guo,Yucheng Tang,Yufan He,Can Zhao,Andriy Myronenko,Dong Yang,Daguang Xu,Yuyin Zhou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous agents are increasingly expected to support end-to-end medical-AI research workflows, moving beyond isolated prediction tasks or short-form clinical question answering. However, existing medical agent benchmarks primarily evaluate final outputs, providing limited visibility into agent behavior within the research process. To address this gap, we present AutoMedBench, a workflow-aware benchmark for autonomous medical-AI research across diverse medical imaging and multimodal inference tasks, organizing agent execution into a unified five-stage workflow (S1-S5): Plan, Setup, Validate, Inference, and Submit. It comprises long-horizon tasks with each run averaging 33 agent turns, spanning five research tracks: segmentation, image enhancement, visual question answering (VQA), report generation, and lesion detection. Each task is evaluated under two difficulty tiers, Lite and Standard, which use the same data and metrics but differ in the amount of task-brief scaffolding, and each run is scored using both final task performance and S1-S5 stage scores, enabling stage-level analysis from the initial task brief to the final submitted artifact. Across thousands of recorded runs, stage-level scoring reveals that Validate is the weakest workflow stage on average, whereas Setup is the strongest, suggesting that current agents are better at making pipelines executable than at verifying their reliability. Post-run error analysis further shows that verification and submission failures dominate tagged errors, accounting for 37.7% and 38.1% of fired codes respectively, whereas task-understanding errors are rare at 0.9%, and runs with one fired error code have a 48% lower overall score than runs with no error code on average.
[AI-47] VET: A Framework for Analyzing AI Discourse
链接: https://arxiv.org/abs/2606.01929
作者: Meredith Ringel Morris
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Public discourse on AI has become polarized; exaggerated positions on AI in traditional and social media threaten the development of AI Literacy among the general public. In this article, I introduce the VET Framework, a method for categorizing AI discourse along the dimensions of valence, effectiveness, and trajectory. I show how this framework can be used to identify, compare, and critique prevalent narratives of AI Hype, AI Doom, AI Denial, and AI Normalcy. Using VET, I analyze how each of these four stances exaggerates some aspects of the current state and/or likely evolution of AI, and illustrate how the VET framework can serve as an AI Literacy tool by supporting the ``vetting’’ of polarized AI discourse.
[AI-48] SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
链接: https://arxiv.org/abs/2606.01912
作者: Kuan Li,Shuo Zhang,Huacan Wang,Fangzhou Yu,Zecheng Sheng,Yi Gu,Weipeng Ming,Lei Xue,Chen Liu,Sen Hu,Ronghao Chen,Siyue Lin,Yuqing Hou,Xiaofeng Mou,Yi Xu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.
[AI-49] Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
链接: https://arxiv.org/abs/2606.01909
作者: Louis Mouchon
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 18 pages, 17 tables, 1 figure. Proof-of-concept, independent research
Abstract:We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.
[AI-50] Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement
链接: https://arxiv.org/abs/2606.01906
作者: Keito Inoshita,Takato Ueno
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.
[AI-51] Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation ACL2026
链接: https://arxiv.org/abs/2606.01897
作者: Tianjiao Li,Kai Zhao,Xiang Li,Yang Liu,Huyang Sun
类目: Artificial Intelligence (cs.AI)
备注: Published as a main conference paper at ACL 2026
Abstract:Traditional Video Quality Assessment (VQA) focuses narrowly on aesthetic fidelity, overlooking the complex social dynamics that define quality in User-Generated Content (UGC). In this work, we propose a paradigm shift from signal-centric metrics to human-centric resonance assessment. We introduce CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance), a new task that evaluates whether a UGC item achieves positive community resonance based on its multimodal attributes rather than visual quality alone. To address this, we present MEDEA (Multimodal Engagement-Driven Evaluation Architecture), which introduces a novel Social Chain-of-Thought (Social-CoT) mechanism. Unlike traditional logical CoT, Social-CoT performs multimodal perspective-taking, instantiating diverse viewer personas to simulate collective cognitive and emotional reactions (i.e., the “community mind”) before deriving a quality judgment. MEDEA is trained via a two-stage approach involving supervised fine-tuning and process-supervised reinforcement learning with Social Alignment Reward to ensure reasoning paths are grounded in authentic human social cognition. To support this task, we release CASTER-Bench, a comprehensive human-annotated benchmark covering diverse UGC categories. Experiments demonstrate that MEDEA significantly outperforms state-of-the-art baselines on CASTER-Bench while providing interpretable and empathetic reasoning paths that align with real community feedback.
[AI-52] Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations
链接: https://arxiv.org/abs/2606.01894
作者: Deyu Zhuang,Peiliang Gong,Yang Shao,Liyuan Shu,Qi Zhu,Xiaoli Li,Daoqiang Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov’s theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.
[AI-53] Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents
链接: https://arxiv.org/abs/2606.01886
作者: Ailiya Borjigin,Igor Stadnyk,Ben Bilski,Maksym Chikita,Dmytro Kyrylenko,Sofiia Pidturkina,Julia Stadnyk
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 17 pages, 3 figures
Abstract:Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk preferences, portfolio context, past judgments, and shifting market assumptions, while the agent answers, retrieves, acts, and forgets. In finance, this is not just inconvenient. In tasks such as market analysis, copy-trading review, and trade preparation, forgotten context and stale memory can create latency, repeated errors, weak auditability, and unsafe decisions. We propose the interaction-native knowledge harness (InKH), an architecture for financial LLM agents that absorbs complexity into the system. InKH converts user, market, portfolio, and tool events into structured operational knowledge. It uses passive knowledge injection to assemble a bounded working context buffer before the main model step, temporal graph memory for low-latency retrieval, a wiki audit surface for human-readable governance, and background extraction with maturity, decay, and write-time invalidation. We evaluate InKH on a reproducible controlled synthetic benchmark with 24 random seeds, 4 rounds, 80 episodes per round, and 6 baselines, producing 46,080 baseline-conditioned evaluations. InKH achieves mean task quality of 0.815 at 900 ms latency. Compared with agent-driven wiki-walk memory, it reduces latency by 82.95 percent, token cost by 82.29 percent, and stale-knowledge usage by 96.58 percent, while improving quality by 0.108 and traceability by 0.461. Compared with a temporal-graph system without invalidation, it improves quality by 0.050 and reduces stale-memory usage by 96.58 percent with comparable serving cost. The results support a design thesis for financial AI: adoption happens when complexity is absorbed by the system rather than transferred to the user. The benchmark validates architecture-level behavior, not live trading performance. Comments: 17 pages, 3 figures Subjects: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE) Cite as: arXiv:2606.01886 [cs.AI] (or arXiv:2606.01886v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01886 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ailiya Borjigin [view email] [v1] Mon, 1 Jun 2026 08:31:35 UTC (3,105 KB)
[AI-54] EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
链接: https://arxiv.org/abs/2606.01884
作者: Ziyuan Li,Yueyu Sun,Yimeng Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent decoding. Recent multimodal approaches use text as a semantic anchor, yet text provides sparse and static supervision for inherently dynamic motor processes. To address this issue, we propose EVA-Net, a two-stage framework that uses action videos as semantic priors for subject-independent EEG motor decoding. In the first stage, EEG and video features are aligned in a shared space using cross-modal and supervised contrastive objectives to reduce subject-specific variation. In the second stage, video category prototypes and knowledge distillation transfer video-derived priors to an EEG-only classifier without adding inference overhead. Experiments on two public datasets show that EVA-Net achieves strong subject-independent decoding performance, including an 8.66% LOSO accuracy gain on EEGMMI. Ablation results further suggest that video provides a more effective semantic anchor than the text baseline considered in this work.
[AI-55] WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis
链接: https://arxiv.org/abs/2606.01869
作者: Shuo Lu,Yinuo Xu,Kecheng Yu,Siru Jiang,Yongcan Yu,Yubin Wang,Haitao Yang,Yuxiang Zhang,Bin Wang,Ran He,Jian Liang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are increasingly asked not only to write static interfaces, but to construct executable interactive worlds from natural language. Browser-native 3D, commonly built with this http URL, is a natural next frontier: generated programs must integrate assets, obey spatial and physical constraints, and keep user-facing controls synchronized with hidden runtime state. Existing web-generation benchmarks and evaluators, however, largely observe only pixels or DOM nodes, while the mechanics of a this http URL world unfold inside an opaque canvas. We introduce WorldCoder-Bench, a benchmark for autonomous, physically grounded 3D world synthesis. WorldCoder-Bench contains 2,026 expert-curated tasks across Simulation, Rendering, and Application scenarios, with optional .glb assets and hidden behavioral contracts. We further propose StateProbe, an execution-based protocol that probes generated programs in a sandboxed browser and verifies hidden, mutation-hardened contracts over runtime states and transitions. Beyond verification coverage, we report Return on Automation and Time Efficiency Multiplier to measure correctness-adjusted cost and time savings. Across nine frontier models, the best system reaches only 27.8% verification coverage on WorldCoder-Core and 19.9% on WorldCoder-Robust, with failures dominated by state-schema drift and broken interaction chains rather than missing scene elements. Utility metrics further show that cheap or fast models can still provide substantial value on easier domains. WorldCoder-Bench is available at this https URL.
[AI-56] Boosting Multimodal Federated Learning via Chained Modality Optimization
链接: https://arxiv.org/abs/2606.01856
作者: Zixin Zhang,Fan Qi,Shuai Li,Xiaoshan Yang,Changsheng Xu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Federated Learning (MMFL) enables privacy-preserving collaborative learning across decentralized clients with heterogeneous data and modality availability. However, most existing MMFL methods cast multimodal training as a joint optimization problem, overlooking a key bottleneck: modality competition, where dominant modalities suppress weaker ones and lead to suboptimal global models. To address this, we propose FedMChain, a balanced MMFL framework that structures federated multimodal training as a chain of modality-wise phases. This phase-wise design gives each modality a dedicated local optimization window on multimodal clients to mitigate modality competition, and further promotes cross-modal complementarity via an error-compensated regularizer. On the server side, we employ a sparse sign-guided aggregation strategy that leverages directional sign agreement for robust intra-modality aggregation, avoids destructive averaging, and supports less frequent synchronization to reduce communication overhead. Extensive experiments on multimodal benchmarks demonstrate that FedMChain consistently improves predictive performance while requiring less frequent communication than baselines.
[AI-57] Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLM s via Conformal Prediction
链接: https://arxiv.org/abs/2606.01850
作者: Yujia Tong,Yuxi Wang,Yunyang Wan,Tian Zhang,Junhao Dong,Jingling Yuan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Model compression techniques such as quantization and pruning are widely used to reduce the deployment cost of large language models (LLMs), with existing evaluations focusing almost exclusively on accuracy preservation. However, in safety-critical applications, a model’s ability to reliably quantify its own uncertainty is equally important. We ask: does compression preserve this ability? To answer this question, we benchmark 12 LLMs under various compression configurations across five NLP tasks, using conformal prediction to provide a rigorous, distribution-free measure of uncertainty. Our experiments reveal that: (I) compression frequently decouples accuracy from uncertainty; (II) larger models absorb compression-induced uncertainty far more effectively than smaller ones; and (III) uncertainty inflation is often threshold-like rather than gradual. These results suggest that accuracy-only evaluation is insufficient for assessing the deployment readiness of compressed LLMs, and that uncertainty-aware benchmarking should be a standard component of model compression pipelines.
[AI-58] Evaluation of Baseline Methods for IDD-based SSD External Memory Search
链接: https://arxiv.org/abs/2606.01840
作者: Yuki Suzuki,Alex Fukunaga
类目: Artificial Intelligence (cs.AI)
备注: accepted to The 19th International Symposium on Combinatorial Search (SoCS2026)
Abstract:Many difficult search problems cannot be solved by algorithms such as A* using only RAM. Search algorithms which use external memory such as SSDs and HDDs with much higher capacity than RAM have been proposed in previous work, but previous work has focused on delayed duplicate detection approaches, as well as complex immediate duplicate detection (IDD) methods, and relatively simple methods for IDD have not been systematically studied. In addition, the effect of OS-level mechanisms for managing and speeding up accesses to external memory, such as page caches, has not been studied. This paper addresses these gaps in the literature by evaluating and analyzing the performance of simple baseline approaches for IDD-based A*.
[AI-59] Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation
链接: https://arxiv.org/abs/2606.01833
作者: Kaihui Cheng,Zhiqiang Cai,Wenkai Xiang,Zhihang Hu,Siyu Zhu,Tzuhsiung Yang,Yuan Qi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by 35% on DynamicPDB-80; (ii) on 12 zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator’s coverage up to \sim15\times faster, and pairing it with refinement reaches the coverage up to \sim37\times faster while covering \sim3\times as many low-energy states. Code will be released soon.
[AI-60] CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback
链接: https://arxiv.org/abs/2606.01830
作者: Bin Chen,Xinye Liao,Yiming Liu,Xin Liao,Chonghan Liu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent’s submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbfCredit-Attenuated Privileged Feedback (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B’s average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.
[AI-61] oken Predictors Are Not Planners: Building Physically Grounded Causal Reason ers
链接: https://arxiv.org/abs/2606.01810
作者: Zheng Lu,Mingqi Gao,Qinlei Xie,Wanqi Zhong,Hanwen Cui,Heng Cao,Zirui Song,Yifan Yang,Chong Luo,Bei Liu,Yiming Li
类目: Artificial Intelligence (cs.AI)
备注: 77 pages, appendices included. Code: this https URL
Abstract:Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.
[AI-62] OctoT2I: A Self-Evolving Agent ic Text-to-Image Router
链接: https://arxiv.org/abs/2606.01803
作者: Xu Jiang,Bin Chen,Gehui Li,Yule Duan,Ronggang Wang,Jian Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose–Solve–Evaluate–Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool’s capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.
[AI-63] MOSS-Audio Technical Report
链接: https://arxiv.org/abs/2606.01802
作者: Chen Yang,Chufan Yu,Hanfu Chen,Jie Zhu,Jingqi Chen,Ke Chen,Wenxuan Wang,Yang Wang,Yaozhou Jiang,Yi Jiang,Zhengyuan Lin,Ziqi Chen,Zhaoye Fei,Chenghao Liu,Jun Zhan,Kang Yu,Kexin Huang,Mingshu Chen,Qinyuan Cheng,Ruixiao Li,Shimin Li,Songlin Wang,Yang Gao,Yiyang Zhang,Xipeng Qiu
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbfDeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and \textbftime markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
[AI-64] Consistency evaluation of benchmarks used for causal discovery
链接: https://arxiv.org/abs/2606.01789
作者: Yuzhe Zhang,Chihui Chen,Lina Yao,Chen Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.
[AI-65] Stochastic convergence of parallel asynchronous adaptive first-order methods
链接: https://arxiv.org/abs/2606.01787
作者: Serge Gratton,Philippe L. Toint
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrtt) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.
[AI-66] Structure-Guided Adaptive Propagation for Protein-Protein Interaction Site Prediction
链接: https://arxiv.org/abs/2606.01781
作者: Enqiang Zhu,Yizi Liu,Yilong Luo,Yao Chen,Yu Zhang,Baoshan Ma
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Accurate prediction of protein-protein interaction sites (PPIS) is essential for understanding cellular processes, disease mechanisms, and therapeutic target discovery. Graph-based deep learning has advanced PPIS prediction by incorporating residue-level structural context. However, most graph-based models still rely on fixed propagation schemes that treat all residues similarly, despite the structural and functional heterogeneity of protein interfaces. Such propagation may limit the ability to adapt information diffusion to local geometric environments, making it difficult to distinguish true interaction sites from structurally similar non-interacting neighbors. We present SGAP-PPIS, a structure-guided adaptive propagation model for PPIS prediction. Rather than using a fixed propagation mechanism, SGAP-PPIS leverages multi-scale geometric states from an equivariant graph neural network to generate residue-wise propagation coefficients. This design allows each residue to adaptively balance local feature preservation and neighborhood diffusion according to its geometric microenvironment. Experimental results show that SGAP-PPIS achieves competitive performance among the state-of-the-art methods on Test_60. Ablation studies show that geometry-conditioned adaptive propagation, scale-aligned geometric guidance, and multi-step propagation-state representation jointly drive these improvements.
[AI-67] FLARE: Diffusion for Hybrid Language Model
链接: https://arxiv.org/abs/2606.01774
作者: Yuchen Zhu,Jing Shi,Chongjian Ge,Hao Tan,Yiran Xu,Wanrong Zhu,Jason Kuen,Koustava Goswami,Rajiv Jain,Yongxin Chen,Molei Tao,Jiuxiang Gu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.
[AI-68] Adaptive Auto-Harness: Sustained Self-Improvement for Agent ic System Deployment on Open-Ended Task Streams
链接: https://arxiv.org/abs/2606.01770
作者: Zewen Liu,Zhan Shi,Yisi Sang,Bing He,Minhua Lin,Tianxin Wei,Dakuo Wang,Benoit Dumoulin,Wei Jin,Hanqing Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Auto-harness systems such as A-Evolve, GEPA, and Meta-Harness improve LLM agents by optimizing prompts, skills, tools, memories, and supporting infrastructure from execution feedback, but they are typically evaluated on fixed offline benchmarks. Real deployments instead present open-ended task streams: histories grow without a fixed endpoint, heterogeneous tasks require different harnesses, and problem distributions shift over time. These challenges make a single repeatedly and densely updated harness brittle, causing performance degradation as accuracy peaks early and then declines. This motivates sustained harness construction with task-wise adaptation. We introduce Adaptive Auto-Harness, a framework and system for such streams. The framework decomposes the gap to an oracle harness into evolution loss and adaptation loss. The system addresses these losses with a stateful multi-agent evolver, a harness tree with solve-time routing, and human-steering hooks for cases where history lacks the needed signal. Across prediction-market, security-competition, and event-forecasting streams, Adaptive Auto-Harness outperforms five existing auto-harness baselines and ablations attribute gains to better construction, routing, or targeted human steering. Code is available in this https URL .
[AI-69] EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks
链接: https://arxiv.org/abs/2606.01767
作者: Yangxuan Zhou,Sha Zhao,Jiquan Wang,Shijian Li,Gang Pan
类目: Artificial Intelligence (cs.AI)
备注: 18 pages,12 figures
Abstract:Electroencephalography (EEG) is the cornerstone of non-invasive brain-computer interfaces (BCIs), yet conventional decoding relies on fragmented, task-specific architectures that severely limit cross-task scalability. While EEG foundation models pre-trained on massive corpora promise universal brain decoding, current post-training depends on task-isolated fine-tuning. This static paradigm restricts knowledge transfer across heterogeneous tasks, hinders model scalability, and incurs computational and storage overheads that scale linearly with task count. To overcome these bottlenecks, we formulate downstream adaptation as a cross-task continual learning problem and propose EvoBrain, a dynamic, task-aware continual learning framework for unified EEG decoding. EvoBrain addresses the plasticity-stability trade-off via two complementary components: (1) Neuro-Spectral Task Normalization (NSN) aligns incoming tasks with historical statistics while recalibrating spectral responses to handle distributional and neuro-spectral shifts; and (2) Response-Affinity Distillation (RAD), combined with time-dependent replay, preserves old-task response geometry and promotes selective knowledge transfer between spectrally compatible tasks, effectively mitigating forgetting. Extensive evaluations across six distinct BCI tasks demonstrate that EvoBrain consistently surpasses state-of-the-art methods across diverse foundation backbones, optimally balancing plasticity and stability. To our knowledge, this work pioneers cross-task continual learning in the EEG domain, advancing the realization of a unified, one-for-all brain decoding system.
[AI-70] SECUREVENT: Hybrid AI/ML Security Monitoring for Distributed Event-Based Systems
链接: https://arxiv.org/abs/2606.01741
作者: Eric Liang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Distributed event-based systems have become a common substrate for Internet-scale publish/subscribe services, IoT telemetry, cloud-native microservices, and security operations pipelines. Their loose coupling and asynchronous delivery improve scalability, but they also expand the attack surface: publishers, brokers, subscribers, topics, schemas, and temporal ordering can each be abused without a single component observing the whole behavior. This paper proposes SECUREVENT, a hybrid AI/ML security-monitoring architecture for distributed event-based systems. The architecture combines traditional protections such as authenticated transport, topic-level authorization, and signed events with online anomaly detection, graph-aware behavioral features, complex-event policy rules, federated learning, and adversarial-ML governance. A deterministic prototype study over synthetic event-stream attacks illustrates how a hybrid AI/CEP monitor can improve recall over static rules while retaining a low false-positive rate. The central claim is not that machine learning replaces cryptographic and access-control mechanisms, but that model-based security monitoring is necessary when event flows, identities, schemas, and timing relationships are too dynamic for static controls alone.
[AI-71] rafficRAG : A Multimodal RAG Framework for Traffic Accident Liability Determination ICANN2026
链接: https://arxiv.org/abs/2606.01737
作者: Xu Li,Zedong Fu,Xinyi Li,Xun Han
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures, accepted at ICANN 2026
Abstract:Traffic accident liability analysis is a critical yet challenging task in intelligent transportation and legal assistance. Existing methods often suffer from low efficiency, subjective judgment, and inconsistent analysis results. Meanwhile, large language models are constrained by noisy video inputs and insufficient legal domain knowledge. To address these issues, this work presents TrafficRAG, a multimodal retrieval-augmented framework for automated traffic accident analysis and report generation. Specifically, the proposed framework first adopts a vision-language model to produce structured textual descriptions of accident scenarios, which serve as accurate retrieval queries. Based on these textual queries, a hybrid retrieval strategy integrating BM25 sparse retrieval and dense embedding retrieval is employed to fetch relevant traffic regulations and similar historical cases. Finally, the large language model incorporates retrieved legal knowledge and multimodal accident evidence for comprehensive reasoning, and generates standardized, legally grounded liability analysis reports. Extensive experiments show that TrafficRAG consistently outperforms baseline methods, achieving 77.32% Legal Norm Adaptation Accuracy, 81.71% Factual Faithfulness, and a Liability Ratio MAE of 5.48%. The results validate that integrating multimodal factual evidence with legal clauses via retrieval augmentation can effectively improve the reliability and accuracy of traffic accident liability determination.
[AI-72] Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization
链接: https://arxiv.org/abs/2606.01730
作者: Jiangyu Chen,Banyi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.01730 [cs.AI] (or arXiv:2606.01730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-73] Characterization of Multi-Model Agent ic AI Systems on General Tasks via Trace-Driven Simulation
链接: https://arxiv.org/abs/2606.01725
作者: Donghwan Kim,Prakhar Singh,Younghoon Min,Jongryool Kim,Jongse Park,Kiwan Maeng
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 18 figures, 2 tables
Abstract:Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.
[AI-74] Shortcut to Nowhere: Demystifying Deep Spurious Regression
链接: https://arxiv.org/abs/2606.01723
作者: Guanrong Xu,Jessica Li,Hao Wang,Yuzhe Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.
[AI-75] Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure
链接: https://arxiv.org/abs/2606.01722
作者: Jun He,Deying Yu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 8 pages, 1 table
Abstract:For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.
[AI-76] Fair Finetuning Mitigates Distribution Inference Attacks
链接: https://arxiv.org/abs/2606.01719
作者: Rakshit Naidu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 16 pages (11 main, 5 appendix)
Abstract:Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions – a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound \textAdv(\mathcalA,M_f) \le \Delta_\textEO \cdot W , where W quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold \tau!=!0.1 across all settings; on ACS Income, the gap falls from \sim!15% to under 4% . Our work provides the first formal bound connecting a model’s measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.
[AI-77] wo-Fidelity Best-Action Identification for Stochastic Minimax Tree
链接: https://arxiv.org/abs/2606.01708
作者: Peter Chen,Xi Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 36 pages
Abstract:We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.
[AI-78] HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark
链接: https://arxiv.org/abs/2606.01686
作者: Seonghyeon Go,Yumin Kim
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI’s utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human’ paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking’': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.
[AI-79] DOT-MoE: Differentiable Optimal Transport for MoEfication ICML2026
链接: https://arxiv.org/abs/2606.01666
作者: Udbhav Bamba,Arnav Chavan,Aryamaan Thakur,Steve Teig,Deepak Gupta
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model’s performance while reducing active parameters by 50%.
[AI-80] E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation
链接: https://arxiv.org/abs/2606.01634
作者: Lin Jiang,Dahai Yu,Ximiao Li,Guang Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 48 pages,26 figures
Abstract:Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.
[AI-81] A Framework for Graph-Conditioned Hierarchical Shapley Attribution in Patent Valuation
链接: https://arxiv.org/abs/2606.01632
作者: Joy Bose
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:
Abstract:Estimating the economic contribution of a single patent inside a product that embodies tens of thousands of patents is a long-standing unsolved problem in intellectual property economics. We propose PatentXAI, a framework that treats patent valuation as a problem of explainable AI: given a characteristic function v(S) encoding the revenue achievable by patent subset S, a patent’s Shapley value measures its fair share of product profit in a way that satisfies efficiency, symmetry, dummy, and additivity. To make computation tractable we restrict each patent’s coalition to its Markov Blanket inside a knowledge graph, grounded in the C-SVE conditional independence theorem (Li et al., 2020). Scaling experiments from n=12 to n=100 patents using Pareto-distributed coverage graphs report median Markov Blanket size of 32.9 percent of n at n=100, with 90th-percentile blanket size of 55.2 percent of n, and runtime of 10 milliseconds per patent. Difference against exact ground truth at n=12 is 0.088; difference against a high-sample Monte Carlo reference at n=100 is 0.062 plus or minus 0.003. A dense-component experiment shows that when 80 percent of patents share one component, the blanket correctly expands to cover that dense cluster, and the difference versus reference falls to 0.039 because the pooled computation becomes more accurate on homogeneous portfolios. Profit allocation proceeds hierarchically: exact Shapley distributes total profit among macro-components, then centrality-weighted Shapley distributes each component budget among covering patents. Estimating v(S) from real data is the primary open problem; we distinguish this from the computational contribution and outline a concrete roadmap for empirical validation using public ETSI, USPTO, and this http URL datasets.
[AI-82] ReSkill: Reconciling Skill Creation with Policy Optimization in Agent ic RL
链接: https://arxiv.org/abs/2606.01619
作者: Zelin He,Haotian Lin,Boran Han,Wei Zhu,Haoyang Fang,Bernie Wang,Xuan Zhu,Runze Li,Matthew Reimherr
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic’s Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy’s ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.
[AI-83] Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization
链接: https://arxiv.org/abs/2606.01610
作者: Haoben Huang,Shuxin Liu,Ou Wu,Di Gao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Single-edit updates in large language models can trigger ripple effects across local knowledge neighborhoods: desirable propagation to related facts and unintended perturbation of preserved ones. Existing methods address these two effects separately, without explicitly modeling their coupling. We challenge this separation through an analysis of ripple responses across typical baselines, identifying two coupled design pressures: editable-side coordination and preserved-side leakage. We propose Joint Neighborhood Optimization (JNO), a new knowledge-editing framework to formalize and jointly address both pressures at the target-planning stage. JNO instantiates this principle through Pressure-Aware Coordination (PAC), which jointly optimizes neighborhood target representations under coupled constraints, and a semantic pre-execution gate that rejects high-risk target plans before parameter execution. Experiments on RippleEdits show JNO improves propagation and preservation metrics by at least 7.0% while preserving cross-backbone editing stability.
[AI-84] FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment IJCNN2026
链接: https://arxiv.org/abs/2606.01607
作者: Nazmus Shakib Shadin,Aaron Cummings,Xinyue Zhang,Bobin Deng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by IJCNN 2026
Abstract:Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.
[AI-85] Estimating Mutual Information between Time Series and Temporal Event Sequences Across Diverse Analysis Tasks
链接: https://arxiv.org/abs/2606.01602
作者: Haoji Hu,Huaqing Mao,Yijun Lin,Xiaowei Jia,Jinwei Zhou,Minoh Jeong,Yao-Yi Chiang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:Pairwise dependence measures such as correlation and causality are fundamental to temporal data mining, yet there is still no principled and robust way to quantify dependence between heterogeneous data types, especially between continuous time series and discrete temporal event sequences. Existing approaches rely on ad hoc transformations or mutual-information estimators that are highly sensitive to quantization, repeated values, and event redundancy, leading to biased or unstable results in practice. We propose a nonparametric mutual information estimator that directly measures the dependence between time series and event sequences without data transformation, learning, or ad hoc discretization. Our method models the continuous-discrete duality of real-world time series to handle quantization and repeated-value artifacts and introduces a latent event clustering strategy to mitigate bias from event co-occurrence and redundancy. Together, these yield a robust and unified framework that bridges discrete and continuous mutual information. We evaluate the proposed estimator on four representative tasks: discrete-continuous time-delayed mutual information for causality analysis, global and local temporal repetition discovery, discrete covariate selection for time series forecasting, and continuous feature selection for classification. Experiments on synthetic and real-world datasets show consistent improvements over existing methods in accuracy, robustness, and interpretability, positioning our approach as a general-purpose dependence operator for heterogeneous temporal data, similar to Pearson correlation for homogeneous time series. Code available at: this https URL
[AI-86] RON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL
链接: https://arxiv.org/abs/2606.01599
作者: Tianze Yang,Yucheng Shi,Ruitong Sun,Jingyuan Huang,Ninghao Liu,Jin Sun
类目: Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures
Abstract:Reinforcement learning (RL) for visual reasoning needs scalable, verifiable, and controllable training signals. Existing visual RL post-training trains on static curated datasets, with fixed image-question-answer samples bounded by their collection budget. In this work, we introduce TRON (Targeted, Rule-verifiable Online eNvironments), an online environment substrate: a training rollout is generated on demand by a controllable generator-verifier program that samples a fresh latent visual state, renders an image, asks a question, and exactly verifies the answer. A single run can therefore draw an unbounded stream of fresh instances at the difficulty level required by the current curriculum. The current TRON suite contains 520 environments organized into five ability buckets (spatial, mathematical, diagram, pattern/logic, and counting); the same substrate supports both a single full model trained on all buckets and per-bucket ability-specialist models, with no additional data collection. We also introduce a substrate analysis covering generation reliability, instance and level diversity, cross-environment near-duplicates, and base-model pass rate by difficulty level. RL post-training with METHOD consistently improves performance on ten external multimodal reasoning benchmarks across Qwen3-VL-4B, Qwen2.5-VL-7B, and MiMo-VL-7B-SFT.
[AI-87] S-SPPO: Semantic-Calibrated Self-Play Preference Optimization ICML2026
链接: https://arxiv.org/abs/2606.01561
作者: Xiwen Chen,Wenhui Zhu,Jingjing Wang,Peijie Qiu,Zhipeng Wang,Huayu Li,ZhengXiao He,Xuanzhao Dong,Prayag Tiwari,Mingkun Xu,Yujian Xiong,Feng Luo,Abolfazl Razi,Brendan Hogan Rappazzo,Anderson Schneider,Yuriy Nevmyvaka
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICML2026
Abstract:Aligning Large Language Models (LLMs) with human preferences is often formulated via Direct Preference Optimization (DPO). However, the standard Bradley-Terry instantiation of DPO is limited in modeling common departures from transitivity in human preferences. To address this, recent work has introduced Self-Play Preference Optimization (SPPO), which iteratively refines the policy by training on self-generated win-lose pairs. Our investigation, however, reveals a critical instability in SPPO: the optimization is prone to policy degeneration when the preference oracle assigns overly confident wins to semantically indistinguishable responses. To mitigate this, we propose S-SPPO, a dual-space semantic calibration framework comprising: i) Supervision Calibration via semantic gating, which anneals win rate targets toward the maximum-entropy baseline as semantic overlap increases; and ii) Representation Calibration via latent repulsion to enforce geometric diversity to prevent manifold collapse and maintain latent diversity between chosen and rejected samples. Theoretically, we show that the calibration preserves the constant-sum game structure, facilitating convergence to a Nash Equilibrium. Empirically, S-SPPO avoids the performance degradation seen in prior methods, achieving 52.19% win rate and 47.46% length-controlled win rate on AlpacaEval 2.0 with Llama-3-8B, without using additional human-annotated preferences during training. The code will be available at this https URL.
[AI-88] GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks
链接: https://arxiv.org/abs/2606.01560
作者: Canyixing Cui,Tao Wu,Xingping Xian,Xiao-Ke Xu,Mao Wang,Weina Niu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassortative edges in assortative graphs and assortative edges in disassortative graphs. This structural inversion creates structure-feature mismatches that disrupt neighborhood aggregation across different graph types. However, we find that existing defenses are limited, as they either treat neighborhoods as monolithic under fixed assortativity assumptions or rely on standard softmax classifiers that fail to account for perturbation-induced representation shifts. To further exploit this observation, we adopt a robustness perspective that jointly disentangles node representations and decision spaces, isolating perturbation effects while enforcing well-separated decision regions. Based on this principle, we propose Graph Joint Disentanglement Network (GJDNet), a unified framework for robust node classification across diverse graph assortativity regimes. GJDNet enhances robustness at both representation and decision levels: it employs feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress perturbation-induced structure-feature mismatches, and introduces a Spherical Decision Boundary (SDB) to promote intra-class compactness and inter-class separation in the embedding space, thereby stabilizing decision boundaries under perturbations. Theoretical analysis provides insights into the effectiveness of the proposed disentangled representation and decision mechanisms, while extensive experiments demonstrate that GJDNet consistently achieves strong robustness across graphs with different connectivity regimes.
[AI-89] RoleCDE:Benchmarking and Mitigating Role-Alignment Trade-offs in Role-Playing Agents
链接: https://arxiv.org/abs/2606.01552
作者: Huayi Lai,Shichao Song,Simin Niu,Hanyu Wang,Jiawei Yang,Zhouxing Wang,Zhiqiang Yin,Xun Liang
类目: Artificial Intelligence (cs.AI)
备注: 23pages
Abstract:Role-playing agents(RPAs) are widely used to steer large language models(LLMs) toward role-consistent behavior, yet existing benchmarks mainly evaluate surface-level fidelity and offer limited insight into decision making under role-alignment value conflicts. To address this gap, we introduce RoleCDE, the first benchmark designed to evaluate RPAs under structured conflicts between role-specific values and alignment-oriented constraints. RoleCDE formulates role-aware decision making as cognitive dilemma scenarios, jointly evaluating role-scenario grounding, value conflict resolution, and decision tendencies. The benchmark is constructed at scale, covering approximately 8k diverse role profiles and scenarios and nearly 24k dilemma instances across three difficulty levels and eight role categories. Evaluation of several mainstream LLMs reveals a “Role Value Decoupling” phenomenon, where agents systematically default to alignment-and morality-consistent decisions rather than role-specific values when the two conflict, even under explicit role conditioning. This behavior is largely invariant to dilemma difficulty but varies substantially across role categories. We further show that RoleCDE-based fine-tuning effectively mitigates this decoupling by improving value trade-off reasoning, while preserving general role-playing fidelity and general reasoning performance. Code is available at: this https URL.
[AI-90] N-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions
链接: https://arxiv.org/abs/2606.01540
作者: Farzaneh Heidari,Guillaume Rabusseau
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.
[AI-91] Joint Agent Memory and Exploration Learning via Novelty Signals
链接: https://arxiv.org/abs/2606.01528
作者: Shizuo Tian,Xiaohong Weng,Rui Kong,Yuxuan Chen,Guohong Liu,Yuebing Song,Jiacheng Liu,Yuchen Li,Dawei Yin,Ting Cao,Yunxin Liu,Yuanchun Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In open-ended environments, exploration is fundamental for autonomous agents, yet current language model agents struggle with this. Effective exploration requires memory, but retaining raw interaction histories is computationally expensive over long trajectories. While latent memory offers a solution to compress interaction histories, its training lacks reliable supervisory signals. We introduce \textbfJoint \textbfAgent \textbfMemory and \textbfExploration \textbfLearning (\textbfJAMEL), a framework that trains agentic memory and exploration policy together through novelty-driven interaction. We observe that memory and exploration form a mutually dependent loop: sustained exploration requires memory to distinguish exhausted behaviors from unseen ones, while novelty-seeking interaction provides the supervision needed to make memory useful for future exploration. By utilizing deterministic and persistent novelty signals such as code coverage in the GUI domain, we provide natural, annotation-free supervision for the memory module. Empirical evaluations demonstrate that \ours successfully generalizes to unseen environments. Its exploration capability outperforms open-weight baselines and rivals the exploration depth of a closed-source model while reducing token consumption. Our code and model are open-sourced at this https URL.
[AI-92] ERRA: Task-Embedded Reasoning and Representation Architecture for Cross-Domain Applications
链接: https://arxiv.org/abs/2606.01520
作者: Shayan Shokri
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A single action-conditioned latent predictive architecture can in principle be trained on the structured state of a driving scene, a robot workspace, or a financial order book. The ingredients for doing so within any one domain already exist and are individually validated: masked-latent prediction, action-conditioned latent world models, discrete action tokenization, and joint-embedding prediction on voxelized state. What is not established, and what TERRA addresses, is the transfer question: when does a representation or predictor learned in one structured-state domain carry over to a structurally analogous but otherwise unrelated domain, and by how much. We give this question a formal treatment. We model each domain as a controlled Markov process on a graded latent grid, factor any instantiation into thin domain adapters and a shared domain-invariant core, and identify a cross-domain correspondence with an approximate Markov decision process homomorphism whose quality is measured by a lax bisimulation discrepancy and, for domains lacking a shared coordinate system, by a Gromov-Wasserstein distance between their action-conditioned transition operators. Under a Lipschitz predictor we derive a transfer bound that separates source-model error from structural mismatch, grows geometrically in the prediction horizon, and is certified from below by the Gromov-Wasserstein distance; we then connect latent error to decision regret through the Lipschitz value property of bisimulation metrics. The resulting Structured-State Transfer Hypothesis is stated as a falsifiable claim with a preregistered experimental program, centered on a transfer test from driving scenes to order books, including conditions under which it is refuted. We present no empirical results: this is a research proposal that converts a widely repeated intuition into testable theory.
[AI-93] ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts ICML2026
链接: https://arxiv.org/abs/2606.01509
作者: Heng Zhao,Zilei Shao,Guy Van den Broeck,Zhe Zeng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top- k routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact- k routing, which samples k -expert subsets in the forward pass, and the backward pass uses gradients through each expert’s exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic- k routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact- k achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic- k achieves comparable performance with fewer activated experts.
[AI-94] Agent Operating Systems (AOS): Integrating Agent ic Control Planes into and Beyond Traditional Operating Systems
链接: https://arxiv.org/abs/2606.01508
作者: Ankur Sharma,Deep Shah
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional operating systems were designed around deterministic programs, explicit control flow, and human initiated workflows. Their core abstractions processes, threads, system calls, files, and permissions assume bounded behavior and predictable interaction patterns. Agentic AI systems introduce a different execution model: long-lived, goal-directed entities that reason probabilistically, invoke tools dynamically, and adapt behavior based on feedback. While agents can be implemented as user-space applications today, their execution characteristics stress OS boundaries in scheduling, memory and state management, security, observability, and governance. This paper introduces the concept of an Agent Operating System (AOS), a systems architecture that integrates an agentic control plane into existing operating systems or, in some models, subsumes selected OS responsibilities over time. We provide a precise definition of an AOS, explicit assumptions and non-goals, and a structured decomposition of AOS responsibilities into schedulers, context and memory management, tool and capability registries, policy and trust enforcement, and observability and audit. We analyze limitations of classical OS abstractions for agent workloads, propose integration models from user-space runtimes to distributed control planes, and map AOS concepts onto Linux and Windows primitives. We present security and safety implications, including agent specific threat models, and define evaluation criteria that emphasize deterministic enforcement, auditability, and operator comprehensibility. The objective is not to replace operating systems wholesale, but to establish a rigorous systems foundation for agentic computation that remains controllable, accountable, and secure at scale.
[AI-95] Move the Query Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
链接: https://arxiv.org/abs/2606.01502
作者: Bole Ma,Jan Eitzinger,Harald Köstler,Gerhard Wellein
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:
Abstract:Frontier LLMs increasingly decide what a query attends to with a sparse-attention indexer that picks a few KV-cache blocks per query: attention’s unit is now a small, reusable chunk. Agentic workloads hammer it: many sub-agents query one large codebase, reusing the same blocks. When that corpus outgrows one GPU it is partitioned across instances, so a query and the blocks it selects often sit on different GPUs: answering it means attention across instances. The reflex of prior cross-instance KV systems is to move the cache: pull the selected blocks to the requester. Multi-head Latent Attention inverts the arithmetic, compressing each token’s key and value into one narrow vector, so a routed query row is only ~1 KB, smaller than the chunk it attends; routing the query is then often cheaper than moving the cache. Which primitive wins, over which fabric and request shape, is uncharted, least of all on device-initiated RDMA that makes per-request cross-node transfers cheap. We characterize cross-instance MLA attention on a real multi-node H100 cluster, distilling two reusable artifacts: a topology-aware cost model (probe / transfer / compute / return / merge) and a closed-form route/fetch/local predicate, whose constants we measure on real IBGDA, where the model tracks batched round-trips to within ~7%. At decode it routes the query, trading the cost of moving the cache (a ~3 ms re-adaptation splice for a contiguous chunk, or a scattered gather under selection) for a tens-of-microsecond round trip, and picks the fabric by probe latency, not peak bandwidth. We instantiate the cost model and predicate for MLA, but neither is MLA-specific: they apply wherever compression or sparse selection shrinks attention to small chunks (DeepSeek-V3.2, V4, and GLM-5.1 today). Extending them to a new architecture requires measuring just two coefficients: the routed payload and fetch’s move-the-cache cost.
[AI-96] ClawHub Security Signals: When VirusTotal Static Analysis and SkillSpector Disagree
链接: https://arxiv.org/abs/2606.01494
作者: Vincent Koc,Patrick Erichsen,Jacob Tomlinson,Agustin Rivera,Michael Appel,Nir Paz
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 10 pages, 1 figure, 7 tables, 1 supplimentary dataset
Abstract:Agent skills extend AI agents with reusable instructions, tools, scripts, references, and workflows, establishing a security boundary distinct from both model safety and traditional package-malware detection. ClawHub Security Signals is a sanitized dataset of 67,453 latest public OpenClaw skill versions. Each row pairs redacted this http URL content and sanitized bundled files where present with a final ClawScan registry verdict and evidence from three scanner families: VirusTotal, static heuristic analysis, and NVIDIA SkillSpector. Rather than estimating malicious-skill prevalence, we study scanner disagreement. The three scanners rarely flag the same skills: any pair overlaps on at most 10.4% of their combined positives, only 0.69% of skills are flagged by all three, and 81.9% of flagged skills are identified by a single scanner. The disagreement is structured by attack surface. SkillSpector, which raises semantic agentic-risk advisories rather than malware-reputation signals, is positive for 19,209 of 25,504 suspicious rows (75.3%) but only 14 of 206 malicious rows (6.8%). The malicious-verdict region shows the inverse profile: 150 of 206 malicious rows (72.8%) are VirusTotal-positive, consistent with bundled-code malware evidence. These results show that agent-skill security requires layered governance, not single-scanner allow/block decisions. The corpus is released as a sanitized silver-standard dataset: labels are the registry’s automated verdicts, not human-annotated ground truth, and the release represents an early, versioned snapshot intended to support the community while a human-annotated subset is developed. Further research is encouraged, including models tailored for skill-security triage. Comments: 10 pages, 1 figure, 7 tables, 1 supplimentary dataset Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) ACMclasses: D.4.6; D.2.5; I.2.11 Cite as: arXiv:2606.01494 [cs.CR] (or arXiv:2606.01494v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2606.01494 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-97] MURMUR: An Efficient Inference System for Long-Form ASR
链接: https://arxiv.org/abs/2606.01483
作者: Wei-Tzu Lee,Keisuke Kamahori,Baris Kasikci
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at this https URL.
[AI-98] Hierarchical Online Prompt Mutation with Dual-Loop Feedback for Guardrailed Evidence Document Generation: A Production-Evaluation Case Study
链接: https://arxiv.org/abs/2606.01472
作者: Nataraj Agaram Sundar Tejas Morabia
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages. Production-evaluation case study of guardrailed LLM evidence-document generation
Abstract:High-stakes production document-generation systems require language models to be adaptive, evidence-grounded, and auditable. We present HOPM, a hierarchical online prompt mutation framework evaluated on a real marketplace dispute-evidence workflow. HOPM treats prompts as online policies: a family/version router selects a prompt, deterministic guardrails attribute failures to mutable prompt-token categories, and dual feedback from human review and an automated judge updates both routing and mutation priorities. The primary evidence is an observed matched production-evaluation ablation: seven variants are evaluated on the same 600 cases each, enabling component comparisons against static prompting, manual iteration, bandit-only routing, mutation-only adaptation, human-only feedback, auto-judge-only feedback, and full dual-loop HOPM. Full HOPM improves count win rate over a static control from 34.7% to 45.7% (+11.0 pp; paired McNemar p = 1.31e-11) and amount-weighted win rate from 22.3% to 41.4% (+19.1 pp; 95% paired bootstrap CI [10.3, 28.9] pp). It also increases mean Likert quality from 3.18 to 4.40 and reduces issue-flag rate from 15.3% to 5.2%. Supporting review artifacts cover 770 generated-text reviews, 318 labeled reviewer exports, a 10-case/61-rating calibration slice, and a 70-case/350-rating OCR benchmark; these artifacts calibrate rubric, guardrail, title-risk, and OCR-risk interpretation rather than substituting for the production ablation. The paper includes control setup, sample sizes, confidence intervals, paired tests, prompt-token categories, pseudocode, schema, rubric, guardrail taxonomy, and a constructed example so the evaluation structure can be reproduced without exposing proprietary evidence.
[AI-99] ransferring Information Across Interventions in Causal Bayesian Optimization
链接: https://arxiv.org/abs/2606.01457
作者: Mohammad Ali Javidian
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target’s parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.
[AI-100] On the Evaluation of Spiking Neural Network Configurations for Network Intrusion Detection
链接: https://arxiv.org/abs/2606.01442
作者: Raj Patel,David Amebley,Taye Akinrele,Shaswata Mitra,Sayanton Dibbo,Shahram Rahimi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 1 figure, 3 Tables, This manuscript is under review for IEEE MILCOM 2026. \c{opyright} 2026 IEEE. Personal use is permitted; all other uses require IEEE permission, including reprinting, republication, redistribution, resale, or reuse of copyrighted components
Abstract:Network intrusion detection is a core component of modern cybersecurity infrastructure, yet the deep learning models that dominate the field are computationally demanding, motivating interest in lightweight alternatives suited to edge and neuromorphic deployment. Spiking Neural Networks (SNNs) are therefore a natural candidate, but their design space, spanning the choice of neuron model and spike encoding scheme, remains poorly characterized for intrusion detection. We bridge this gap by using a controlled ablation study using 9 neurons coupled with 3 spike encoding schemes, making 27 variants, all implemented on snntorch evaluated over raw inputs with limited preprocessing on four benchmark datasets (NSL KDD, KDDCup99, CIC-IDS2017, and CTU-13) with 5 seeds. We find that spike encoding scheme is a better determinant for detection quality than the neuron model, where rate and delta spike encodings perform worse than latency encoding over the sweep. The LeakyParallel neuron with latency encoding performed the best overall, averaging at 92.11% accuracy and 0.80 macro- F1 at a rate of 2.01% false positives averaged over all 4 datasets, with accuracy close to perfect for CIC-IDS2017 and CTU-13, and also performed the fastest on inference. These results highlight the potential of SNNs as a viable alternative to traditional methods of intrusion detection when considering low-latency or resource-constrained deployments.
[AI-101] Dive into Ambiguity: A*-Inspired Multi-Agent s Commonsense Obfuscation Attack on LLM Prompts
链接: https://arxiv.org/abs/2606.01441
作者: Boxuan Wang,Zhuoyun Li,Xiaowei Huang,Yi Dong
类目: Artificial Intelligence (cs.AI)
备注: Pre-print
Abstract:Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient \gamma that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as \gamma decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.
[AI-102] CEAR: Certified Ensemble Adversarial Robustness in DNNs
链接: https://arxiv.org/abs/2606.01437
作者: Daniel Sadig,Mohammadreza Maleki,Hamed Karimi,Reza Samavi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This is the preprint of the work accepted for publication in the Proceedings of the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026); 19 Pages
Abstract:Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-critical applications. State-of-the-art empirical defense mechanisms improve the robustness of DNNs through the training phase, but still struggle against adaptive white-box attacks. On the other hand, certified defenses offer provable guarantees of robustness within a specified perturbation bound. These guarantees hold regardless of the level of perturbations, even if the attacker is given full knowledge of the model. In this paper, we propose CEAR, an ensemble-based robust method that utilizes a hybrid of empirical and certified defense mechanisms. CEAR trains each network within the ensemble using varying Gaussian noise and temperatures to obfuscate gradients and logits, making the model more resistant to stronger gradient-based attacks. We then use noisy logits and propose two different voting mechanisms to further improve robustness. Furthermore, we extend randomized smoothing to verify the robustness of ensemble-based classifiers. Our experimental evaluations on MNIST, CIFAR10, and TinyImageNet datasets demonstrate superior certified accuracy on average, increased robustness radius, and decreased transferability compared to baseline methods.
[AI-103] GovAI-Pipe: A Layered AI Governance Pipeline for Citizen-Facing AI in Turkeys e-Government Gateway
链接: https://arxiv.org/abs/2606.01417
作者: Ahmet Kaplan
类目: Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Turkey’s e-Government Gateway (e-Devlet) serves over 68 million registered users with more than 9,200 government services, and is increasingly integrating artificial intelligence into citizen-facing applications such as chatbot assistants and eligibility assessments. However, no structured technical governance infrastructure currently connects high-level AI policy frameworks, such as the EU AI Act, OECD AI Principles, and Turkey’s own National AI Strategy, to the operational reality of deploying AI within a centralized e-government platform. We propose GovAI-Pipe, a four-layer governance pipeline designed using Design Science Research methodology that maps the AI model lifecycle to governance checkpoints: (1) pre-deployment validation for bias testing, explainability, and privacy impact assessment; (2) deployment governance for risk-tier classification and approval workflows; (3) runtime monitoring for drift detection, fairness tracking, and human-in-the-loop escalation; and (4) post-incident governance for audit trails, rollback, and citizen redress. Each layer is anchored to specific provisions of the EU AI Act, the GDPR data protection framework, and the National AI Strategy. We demonstrate the framework through two high-risk e-Devlet use cases, showing how GovAI-Pipe operationalizes governance principles as auditable, technical pipeline components.
[AI-104] Self-Healing Agent ic Orchestrators for Reliable Tool-Augmented Large Language Model Systems
链接: https://arxiv.org/abs/2606.01416
作者: Rahul Suresh Babu,Adarsh Agrawal
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented large language model (LLM) agents rely on orchestration layers that coordinate planning, retrieval, tool invocation, validation, memory, and recovery. In these systems, failures arise not only from model errors, but also from orchestration-level issues such as tool timeouts, malformed arguments, stale context, contradictory evidence, retry loops, and unverified intermediate outputs. This paper presents a self-healing agentic orchestrator that treats reliability as a bounded runtime control problem. The orchestrator maps observable failure signals to inferred failure classes, selects targeted recovery actions under explicit budgets, verifies recovered trajectories, and records observability traces. We evaluate the approach on a 100-task controlled fault-injection benchmark against static workflow, retry-only, ReAct-style, and full-replanning baselines. Self-healing achieves 98.8% task success, compared with 94.5% for retry-only and 93.8% for full replanning. A matched recovery-budget sweep shows that self-healing outperforms retry-only and full replanning at every tested budget, with the largest gap under a single recovery attempt: 94.0% versus 85.3% and 88.2%, respectively. Under a controlled semantic silent-failure setting, verifier-guided self-healing reduces silent failures to 0.0%, while non-verifying baselines return wrong-but-plausible outputs more often. A compact model-in-the-loop validation shows that the same recovery mechanism can operate when a live tool-calling model performs tool selection, argument generation, and answer synthesis over local fault-injected tools. These results provide controlled evidence that failure-aware, budgeted, and verification-guided orchestration improves reliability and diagnosability in tool-augmented LLM systems.
[AI-105] Neural Network Compression by Approximate Differential Equivalence
链接: https://arxiv.org/abs/2606.01402
作者: Ravi Dhiman,Andrea Passarella,Mirco Tribastone,Lorenzo Valerio
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 19 pages, 4 figures
Abstract:Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, \varepsilon , controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.
[AI-106] Bridging Requirements and Architecture: Multi-Agent Orchestration with External Knowledge and Hierarchical Memory
链接: https://arxiv.org/abs/2606.01385
作者: Ruiyin Li,Yiran Zhang,Xiyu Zhou,Yangxiao Cai,Peng Liang,Weisong Sun,Jifeng Xuan,Zhi Jin,Yang Liu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 images, 5 tables, Manuscript submitted to a Journal (2026)
Abstract:Software architecture design is a critical yet inherently complex and knowledge-intensive phase that requires balancing competing quality attributes and adapting to evolving requirements. Traditionally, this process has been time-consuming, labor-intensive, and heavily reliant on architects, often resulting in limited exploration of alternative architectural decompositions and styles, especially under the pressures of agile development. While LLM-based agents have shown promising performance across various software engineering tasks, their application to architecture design remains relatively scarce and requires systematic exploration. To address these challenges, we proposed MAAD (Multi-Agent Architecture Design), a knowledge-driven framework that orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to autonomously and collaboratively transform requirements specifications into comprehensive, multi-view architectural blueprints with quality attribute assessments. MAAD incorporates RAG to inject recognized architectural standards and patterns into the workflow and leverages a hierarchical memory mechanism that captures design history for iterative refinement. We evaluated MAAD through comparative experiments against MetaGPT, using quantitative architecture-level metrics across 10 case studies and qualitative feedback from industry architects on 10 real-world specifications. Results show that MAAD generates more complete, modular, and traceable architectures than the baseline, and its dedicated Evaluator agent autonomously produces structured quality evaluation reports that significantly reduce manual validation efforts. Furthermore, we found that the quality of the generated architecture heavily depends on the underlying LLM’s reasoning capacity, with GPT-5.2 and Qwen3.5 outperforming other models across most evaluation settings.
[AI-107] Efficient Exploration for Iterative Nash Preference Optimization
链接: https://arxiv.org/abs/2606.01382
作者: Tianlong Nan,Xiaopeng Li,Christian Kroer,Tianyi Lin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 49 pages
Abstract:Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of scalable NLHF remain limited. Existing regret guarantees rely on oracle-based methods that estimate a general preference model and solve KL-regularized minimax problems, while iterative NLHF methods directly optimize policy-level preference losses and are easier to implement but lack regret guarantees. We study online iterative NLHF under general preference models and identify exploration as the key obstacle. First, we show that standard iterative NLHF can suffer an exponential dependence on the KL-regularization parameter, revealing that implicit exploration through policy updates is insufficient for controlling regret. Second, we propose an explicitly exploratory iterative NLHF algorithm that combines SFT-based regularization with adversarial policy exploration. The resulting method retains the direct policy optimization structure of iterative NLHF, avoids explicit preference model estimation, and achieves an O(\sqrtT) regret bound without an exponential dependence on the KL-regularization parameter. We show that the regret can be improved to O(\log(T)) with access to a minimax oracle, clarifying the computational-statistical tradeoff in learning general preference games. Finally, we instantiate our method for LLM fine-tuning and evaluate it on \textttLlama-3-8B-Instruct across multiple benchmarks, where explicit exploration yields consistent improvements over existing NLHF baselines.
[AI-108] Beyond Access: Guided LLM Scaffolding for Independent Learning in Undergraduate Statistics
链接: https://arxiv.org/abs/2606.01375
作者: Mohammad Amanlou,Yasaman Amou-Jafari,Mehrad Livian,Fatemeh Boloukazari,Fereshte Bagheri,Behnam Bahrak
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, conference: Proceedings of the 34th International Conference on Computers in Education. Asia-Pacific Society for Computers in Education
Abstract:Large language models (LLMs) are increasingly entering students’ learning practices, but their educational value depends on whether they support reasoning or enable task completion without engagement. This study examines guided LLM use in an undergraduate Probability and Statistics course, focusing on the gap between assigned access and actual interaction quality. In a four-week quasi-experimental summer program, students were organized into three balanced conditions: no LLM access, unrestricted LLM access, and guided LLM access. The guided condition used the same LLM platform as the unrestricted condition, but students received explicit training and rules promoting reasoning-focused help-seeking, stepwise hints, verification, and ethical use. All quizzes and the delayed final exam were completed without LLM or external assistance, allowing us to distinguish AI-supported practice performance from independent learning. Results show that guided use was associated with clearer learning-oriented interaction patterns than unrestricted access, especially in prioritizing reasoning over final answers and requesting stepwise support. Guided-LLM students showed stronger no-help quiz performance during the intervention phase, whereas unrestricted access appeared more useful for assisted practice completion than for consistently improving independent performance. Available time measures did not support a simple duration-based explanation, and self-assessment calibration suggested better alignment between perceived and demonstrated understanding in the Guided-LLM condition. Overall, LLM access alone appears to be an incomplete educational intervention. For Artificial Intelligence in Education (AIED), the central design challenge is to scaffold how students use LLMs so that these systems function as partners in reasoning rather than answer-getting tools.
[AI-109] Early Diagnosis of Wasted Computation in Multi-Agent LLM Systems via Failure-Aware Observability
链接: https://arxiv.org/abs/2606.01365
作者: Xianyou Li,Weiran Yan,Yichao Wu,Penghao Liang,Mengwei Yuan,Jianan Liu,Jing Yang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-using multi-agent large language model (LLM) systems spend computation through model tokens, tool calls, retries, and code execution before producing an answer. When a run fails, final-answer evaluation reveals the endpoint but usually not the point at which the trajectory stopped making recoverable progress. This paper introduces a failure-aware observability framework for diagnosing wasted computation in multi-agent LLM traces. The framework maps recurring failure modes to online trace signals, including tool reliability, execution recovery, orchestration loops, evidence availability, information change, and budget pressure. We instantiate the framework in a three- agent question-answering system and evaluate it on 165 GAIA validation traces under identical execution caps. Operational failures remain common: 22/53 level-1 runs, 33/86 level-2 runs, and 12/26 level-3 runs fail to produce a usable final answer. The traces expose different mechanisms behind these outcomes, including insufficient evidence, repeated-action loops, max-step termination, tool-failure streaks, and execution calls that succeed without useful output. Mean token use rises from 8,152 tokens at level 1 to 16,389 tokens at level 3, while evidence availability and sentence-level support diverge. A cached 10-trace LLM-judge grounding audit shows that cheap online signals and deeper semantic metrics capture complementary layers of failure. The results position failure-aware observability as a diagnostic layer between raw execution logs and final-answer accuracy.
[AI-110] Needles at Scale: LLM -Assisted Target Selection for Windows Vulnerability Research
链接: https://arxiv.org/abs/2606.01364
作者: Michael J. Bommarito II
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 9 pages, 3 figures, 2 tables
Abstract:The attack surface of a modern operating system is a haystack: thousands of signed binaries and millions of functions, almost none relevant to any given vulnerability. A human analyst or an LLM agent must pick the function worth reading before analyzing it. At whole-OS scope, this target selection, not the analysis, is the binding constraint. We present Symbolicate-Enrich-Sample, a low-cost batch pipeline that turns a corpus of production Windows binaries into a queryable, priority-ranked research queue. We (i) recover function-level symbols for stripped vendor binaries by auto-fetching the public symbol files and joining them to a recovered call graph; (ii) attach cheap, deterministic structural features to each named function and, conditioned on those features, use a low-cost language model to assign a reachability tier, a risk level, a bug-class hypothesis, and a rationale; and (iii) draw diverse, prioritized batches via a priority-weighted importance sampler. The contribution is a selection substrate: the prioritization layer a downstream detector or LLM agent runs on top of. Across a whole Windows image of 7,231,419 functions, the labels are markedly selective, and stacking deterministic filters on them leaves a ~22K-function shortlist: the candidate needles, few enough for a human or agent to work through. We characterize the pipeline’s selectivity and its failure modes, describe the methodology, and report aggregate statistics; we withhold the derived dataset for legal and dual-use reasons.
[AI-111] FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors KDD’26
链接: https://arxiv.org/abs/2606.01352
作者: Hongxu Ma,Han Zhou,Chenghou Jin,Jie Zhang,Xiaoyu Yang,Chunjie Chen,Jihong Guan,Shuigeng Zhou
类目: Artificial Intelligence (cs.AI)
备注: Accepted by KDD’26
Abstract:Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm – Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime’s significant superiority over SOTA methods.
[AI-112] Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems
链接: https://arxiv.org/abs/2606.01351
作者: Junze Zhu,Weihao Chen,Xuanwang Zhang,Zhen Wu,Xinyu Dai
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The transition from single-turn models to Multi-Agent Systems (MAS) promises enhanced problem-solving capabilities, yet the centralized orchestration topology remains a critical point of fragility. To analyze this, we propose a Mean-Field Entropy Dynamics framework, modeling the orchestration process as a system governed by the competing forces of task resolution and cumulative context loading. To facilitate validation, we introduce Inverse Workflow Generation (IWG), a multi-agent pipeline that synthesizes process-verifiable, high-complexity benchmarks with dense intermediate checkpoints. We demonstrate that our entropy dynamics model fits empirical trajectories, providing physically interpretable parameters that quantify system stability and performance collapse. Crucially, our analysis uncovers a ``Reasoning Trap": while reasoning-heavy models excel in isolated tasks, they frequently fail as orchestrators due to context squeezing. Elucidating the physical mechanisms underlying the Orchestrator and quantifying systemic uncertainty offers insights for the MASs’ architectural design.
[AI-113] Digital Twin-Assisted Adaptive Multi-Agent DRL for Intelligent Spectrum and Resource Management in Open-RAN UAV-Enabled 6G Networks
链接: https://arxiv.org/abs/2606.01324
作者: Marwan Dhuheir,Thang X. Vu,Symeon Chatzinotas
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: accepted and presented at IEEE ICC-2026 conference paper
Abstract:The evolution toward 6G wireless networks envisions a seamlessly intelligent, Open-RAN-enabled architecture where unmanned aerial vehicles (UAVs) play a pivotal role in extending coverage, enhancing resilience, and ensuring reliable connectivity for ground users deployment. However, efficiently managing spectrum and resources in such highly dynamic UAV-assisted environments remains a major challenge due to nonlinear system interactions, mobility-induced topology variations, and stringent latency and energy constraints. To address these challenges, we propose a digital twin (DT)-assisted adaptive deep reinforcement learning (DRL) framework that enables intelligent spectrum sharing and resource allocation across distributed ground users. The complex optimization problem is decomposed into UAV trajectory optimization using particle swarm optimization (PSO) and dynamic spectrum-power-association management via multi-agent DRL (MADRL). This hybrid DT-driven approach empowers intelligent, context-aware decision-making and adaptive coordination among UAVs. Extensive simulations demonstrate significant gains in spectral efficiency, data rates, and energy utilization, showcasing a transformative path toward self-evolving, autonomous 6G UAV and ground users (GUs) connectivity.
[AI-114] Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery
链接: https://arxiv.org/abs/2606.01316
作者: Zhe Zhao,Haibin Wen,Yingcheng Wu,Jiaming Ma,Yifan Wen,Jinglin Jian,Jiacheng Ge,Xiangru Tang,Bo An,Ming Yin,Sanfeng Wu,Mengdi Wang,Le Cong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed–one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation–and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability–a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline–can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process–a step towards scaling AI-native discovery to the planet. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.01316 [cs.AI] (or arXiv:2606.01316v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01316 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Zhe Zhao [view email] [v1] Sun, 31 May 2026 16:05:41 UTC (9,756 KB) Full-text links: Access Paper: View a PDF of the paper titled Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery, by Zhe Zhao and 12 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-115] SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
链接: https://arxiv.org/abs/2606.01314
作者: Yangbo Wei,Zhen Huang,Shaoqiang Lu,Junhong Qian,Qifan Wang,Chen Wu,Lei He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent self-evolving agents have shown that skills can be discovered, refined, and accumulated through execution. However, existing skill-evolution frameworks typically assume a fixed tool layer and evaluate each skill independently, limiting their ability to repair tool-level failures or reason about interactions among skills. We propose SkillSmith, a synergy-aware skill-tool co-evolution framework. SkillSmith introduces a unified proposal space in which reflection produces atomic bundles that jointly modify skills and tools, allowing tools to be wrapped, edited, composed, split, or retired when skill evolution identifies a reusable capability gap. To guide this joint search, SkillSmith maintains an ecological utility model inspired by Lotka-Volterra dynamics, where an interaction matrix estimated from execution traces captures pairwise complementarity and conflict among skills and provides pressure signals for retrieval, mutation prioritization, and retirement. Furthermore, SkillSmith records anti-patterns, including failure signatures, causal attributions, and remedies, to accelerate diagnosis and veto proposals that repeat known mistakes. Experiments on three benchmarks, including WildClawBench, and five Qwen3.5 model scales show that SkillSmith consistently outperforms strong baselines, with gains that amplify as task complexity and multi-skill co-activation increase.
[AI-116] PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making ICML2026
链接: https://arxiv.org/abs/2606.01313
作者: Rufeng Chen,Yue Chang,Xiaqiang Tang,Hechang Chen,Sihong Xie
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 21 pages, 7 figures. ICML 2026
Abstract:Open-vocabulary navigation requires embodied agents to manage significant perception uncertainty stemming from semantic ambiguity and model errors. However, most existing works settle for local optimal deterministic approaches, depriving complex navigation decision-making over multiple composite possibilities that are critical for globally better solutions. In this paper, we propose Probabilistic Scene Graph Navigation (PSG-Nav), which constructs a 3D Probabilistic Scene Graph that uses full semantic categorical distributions to account for perception uncertainty. To efficiently use the local distributions to compose and reason about the optimal navigation landmarks, we propose Multiverse Decision to sample multiple most likely world settings from the joint distribution, and evaluate navigation landmarks based on the compatibility between landmarks and multiverses. To mitigate false positives due to epistemic uncertainty in open-vocabulary navigation, we introduce the Evidential Experience Calibrator, which enables online lifelong adaptation by cross-validating detections against memories of past successes and failures. Extensive experiments on widely-used benchmarks MP3D, HM3D, and HSSD demonstrate that PSG-Nav establishes new state-of-the-art results, achieving Success Rates of 66.1%, 44.8%, and 67.9%, respectively. Code is available at: this https URL Comments: 21 pages, 7 figures. ICML 2026 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.01313 [cs.RO] (or arXiv:2606.01313v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.01313 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rufeng Chen [view email] [v1] Sun, 31 May 2026 16:00:19 UTC (6,898 KB) Full-text links: Access Paper: View a PDF of the paper titled PSG-Nav: Probabilistic Scene Graph Navigation via Multiverse Decision Making, by Rufeng Chen and 4 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.RO prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-117] ChronosAD: Leverag ing Time Series Foundation Models for Accurate Anomaly Detection
链接: https://arxiv.org/abs/2606.01300
作者: Uzair Khan,Luigi Capogrosso,Francesco Biondani,Michele Magno,Franco Fummi,Francesco Setti,Marco Cristani
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 24th IEEE International Conference on Industrial Informatics (INDIN) 2026
Abstract:Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods often struggle to generalize across different datasets, especially when anomalies are subtle or context-dependent. To solve this issue, we introduce ChronosAD, a novel architecture for anomaly detection that uses a time series foundation model as a feature extractor. Specifically, it employs a two-stage pipeline: first, it uses the foundation model to extract embeddings for each time series in a zero-shot manner. Then, a custom-developed Temporal Block, composed of Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention, refines these embeddings to capture temporal dependencies and highlight salient patterns. Unlike previous approaches, our model requires minimal task-specific tuning and demonstrates robust generalization across a wide range of domains, including industrial, medical, cyber-physical, and automotive systems. Extensive experiments on 11 benchmarks show that ChronosAD outperforms existing methods by 4.72% in AUC and 6.60% in AP on average. The source code is available at this https URL.
[AI-118] What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression
链接: https://arxiv.org/abs/2606.01292
作者: Wendao Wu,Fangqing Zhang,Haihan Zhang,Cong Fang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Distillation (KD) to the emergent phenomenon of Weak-to-Strong (W2S) generalization. While existing studies offer isolated insights, a unified theoretical framework explaining the efficacy of KT across these disparate regimes remains lacking. In this work, we establish a unified spectral analysis of SGD dynamics in high-dimensional linear regression, elucidating the efficiency of KT across seemingly disparate regimes. We characterize KT efficiency through two distinct mechanisms: \emphSpectral Horizon Expansion in KD, which enables the capture of statistically inaccessible high-frequency signals, and \emphSpectral Denoising in W2S, where the student acts as a filter for optimization noise. Our framework unifies these phenomena, revealing that the efficacy of transfer is governed by the interplay between implicit regularization and heterogeneous spectral learning speeds over the spectrum.
[AI-119] RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning
链接: https://arxiv.org/abs/2606.01281
作者: Yixiu Mao,Yun Qu,Qi Wang,Heming Zou,Xiangyang Ji
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.
[AI-120] ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment
链接: https://arxiv.org/abs/2606.01279
作者: Zhengyang Zhao,Shengjie Ye,Lu Ma,Hao Liang,Hengyi Feng,Wentao Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI agents are increasingly being tasked with automating AI research itself, particularly the critical post-training phase that transforms base LLMs into aligned assistants. However, recent evaluations reveal that even frontier agents struggle to perform this task. While the success of post-training fundamentally relies on acquiring high-quality data, relying on agents to autonomously curate targeted training datasets from the open web introduces severe challenges. Executing the long-horizon tasks of searching, filtering, and balancing data within noisy web environments frequently overwhelms an agent’s limited context, ultimately leading to degraded dataset quality and suboptimal downstream training performance. To bridge this gap, we introduce Andes (Agent Native Data Evolving Synthesis), a framework that reimagines data generation as a plug-and-play \emphagent skill. Rather than forcing agents to devise complex data-gathering strategies from scratch, \textscAndes provides an intelligent abstraction layer. By leveraging a self-evolving World Tree routing mechanism and actionable diagnostic reports, it allows trainer agents to dynamically steer data synthesis through an interactive, closed-loop interface. We demonstrate that under strict compute constraints, equipping foundationally weaker agents with Andes improves automated alignment, securing state-of-the-art performance on PostTrainBench and robust cross-task generalization. Our project is available at this https URL.
[AI-121] Emergent Ordinal Geometry in Transformers Trained on Local Comparisons
链接: https://arxiv.org/abs/2606.01269
作者: Nishit Singh
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 10 figures
Abstract:Transitive inference is the challenge of inferring that A C from knowing only adjacent relations (A B, B C). It is solved by humans and animals not through logical chaining but via an analogue mental number line, whose signature is the symbolic distance effect: distant comparisons are easier than nearby ones. We ask whether Transformers acquire the same primitive, training small models exclusively on adjacent comparisons from a hidden total order and evaluating generalization to unseen distant pairs. We find that out-of-distribution generalization emerges alongside a striking geometric reorganization: entity embeddings collapse onto a one-dimensional manifold whose principal axis recovers the hidden rank order with near-perfect fidelity, and this structure is sensitive to optimization in ways that produce grokking-like transient dynamics. Critically, even when accuracy is at ceiling, decision confidence and geometric separation both scale monotonically with rank distance, directly mirroring the symbolic distance effect observed across decades of behavioural experiments on humans, primates, and rodents. These results ground a 50-year-old behavioural regularity in the geometry of learned representations, offering a mechanistic account of transitive inference that bridges cognitive science and modern neural networks.
[AI-122] PALTO: Physics-Informed Active Learning for Tri-Gate FinFET Design Optimization for Vertical Power Delivery
链接: https://arxiv.org/abs/2606.01265
作者: Ayoub Sadeghi,Leonid Popryho,Inna Partin-Vaisband
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper demonstrates the effectiveness of machine learning-driven optimization for designing application-specific GaN tri-gate FinFETs in vertical power delivery systems. Conventional TCAD-based approaches are computationally intensive and insufficient for navigating the high-dimensional, nonlinear design space of advanced GaN devices. To address this, a physics-informed active learning framework is used to intelligently guide simulations, accelerating convergence while preserving accuracy. This ML-guided approach enables the discovery of optimal configurations by efficiently exploring key structural parameters – most notably the GaN-to-AlGaN thickness ratio – a long-standing focus of debate in device design. By systematically exploring key structural parameters, two optimized devices with aggressively scaled gate-to-drain lengths are identified. Single-fin, multi-channel simulations show that device~D2, with a thinner GaN channel relative to the AlGaN barrier, achieves higher drive current. However, in a 300-fin configuration, device~D1 outperforms device~D2 by delivering 3.3,A at 0.49~ohm on-resistance – approximately 2 \times better – despite slightly higher parasitics. Both devices operate in a normally-off mode. Based on an application-specific figure of merit, device~D1 achieves 5,pC \cdot ohm, demonstrating 2 \times greater switching efficiency than device~D2, while both designs outperform industrial benchmarks from different performance standpoints.
[AI-123] SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback
链接: https://arxiv.org/abs/2606.01246
作者: Leo Luo,Haining Xie,Siqi Shen,Zhipeng Ma,Rui Ling,Hang Xu,Hefeng Jiang,Dingwei Chen,Yang Li,Peng Chen,Jie Jiang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Text-to-SQL on complex schemas is unreliable on a single pass, so recent systems generate multiple SQL candidates and let voting filter out errors. Yet voting alone is not enough, because the multi-candidate recipe has three coupled weaknesses: 1) sampling more from a single generator produces increasingly redundant candidates, 2) existing pipelines apply one generic correction to every non-clean execution result, while runtime errors, timeouts, and empty results each indicate a different distance from correctness, and 3) existing selectors rely on a single angle such as result-majority voting or pairwise SQL comparison, missing what other angles would have caught. We present SIRIUS-SQL, which addresses all three weaknesses. A difficulty-smoothing RL recipe trains SIRIUS-32B to generate diverse executable SQL candidates, paired with a generalist LLM that fills in gaps left by the specialist. An execution-grounded lifecycle classifies each outcome and applies targeted repair before candidates re-enter the pool. A confidence-gated hybrid selector combines execution-result agreement with pairwise SQL-form judgment, escalating only near-tied cases to a deterministic structural check. SIRIUS-SQL reaches 75.88% on BIRD dev and 91.20% on SPIDER test. Two of three generalist pairings surpass Agentar-Scale-SQL, the strongest published multi-candidate system on BIRD dev.
[AI-124] Brain-Atlas-Guided Generative Counterfactual Attention for Explainable Cognitive Decline Diagnosis Using Multimodal Connectomes
链接: https://arxiv.org/abs/2606.01237
作者: Xiongri Shen,Jiaqi Wang,Zhenxi Song,Yi Zhong,Leilei Zhao,Xin He,Baiying Lei,Zhiguo Zhang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mild cognitive impairment (MCI) and subjective cognitive decline (SCD) are closely associated with the early Alzheimer’s disease continuum, where accurate and explainable diagnosis is important for early risk assessment and intervention. Existing connectome-based deep learning models can improve classification performance but often provide limited insight into disease-related functional and structural connectivity changes. This paper proposes an atlas-knowledge-guided Generative Counterfactual Attention-guided Network (GCAN) for explainable cognitive decline diagnosis using multimodal brain connectomes. GCAN formulates diagnosis as a source-to-target counterfactual generation problem, where target-label connectomes are generated from source-label inputs and their differences are used to construct counterfactual attention maps. To preserve connectome topology, an Atlas-aware Bidirectional Transformer (AABT) performs network-level token encoding and decoding under brain-atlas constraints. The framework is further extended from functional connectivity (FC) to joint functional and structural connectivity (SC) modeling, enabling counterfactual analysis of complementary functional reorganization and structural topology changes. Experiments on hospital-collected and ADNI datasets show that GCAN achieves competitive performance across HC vs. SCD, HC vs. MCI, and SCD vs. MCI classification tasks. Visualization, circular connectome analysis, CAM-based comparison, ablation studies, and confidence interval analysis further support the interpretability and reliability of the proposed framework. Modality-specific FC and SC pre-trained classifiers are used to provide target-state priors for counterfactual generation while being separated from the downstream diagnostic classifier to prevent data leakage.
[AI-125] HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation
链接: https://arxiv.org/abs/2606.01230
作者: Yi Gu,Huacan Wang,Shuo Zhang,Yuqing Hou,Lei Xue,Weipeng Ming,Chen Liu,Fangzhou Yu,Kuan Li,Ronghao Chen,Sen Hu,Xiaofeng Mou,Yi Xu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language model agents are moving beyond text-only interaction toward physical-world control, with smart homes as a representative domain. Real domestic interaction requires understanding ambiguous intents, operating in dynamic environments, and performing multi-turn reasoning. However, existing methods struggle to generate high-quality training data for smart home agents. We propose HomeFlow, a verifiable data flywheel for this domain. HomeFlow uses HomeEnv as a unified simulation environment and HomeMaker to procedurally generate diverse home settings. Subsequently, Blueprint compiles open-ended user intents into executable state-based success conditions, while MCTS-Flow synthesizes diverse, verifiable multi-turn trajectories through environment-guided tree search. We then optimize the agents via supervised fine-tuning and step-wise RLVE, which facilitates iterative improvement through authentic physical feedback. We further construct SmartHome-Bench to evaluate the agent across various smart home tasks. On this benchmark, HomeFlow-RL-4B and HomeFlow-RL-8B achieve task success rates of 84.60% and 87.03%. It is worth noting that HomeFlow-RL-8B even surpasses the leading GPT-5.5 by 1.23 percentage points.
[AI-126] Application of Algorithms in Energy-Efficient Design Platforms for Green Building
链接: https://arxiv.org/abs/2606.01229
作者: Na Yu,Fu Wenli,Guo Fei
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures.2026 International Conference on Big Data Applications in Education and Engineering (ICBDAEE 2026)
Abstract:During green building design, computer-aided energy assessment is widely used to improve efficiency and achieve overall optimization. This paper presents a platform that combines Building Information Modeling (BIM), sensor operational data, and advanced simulation workflows using robust algorithms. The platform uses a multi-layer service architecture with dynamic energy simulation and evolutionary multi-objective optimization, connected via a high-performance C++ core and adaptive agent models. A mid-rise office building was selected as the case study. Five representative areas were chosen to collect data on building envelope characteristics and occupancy patterns. After preprocessing, missing sensor data accounted for 3.2% of annual records, and all variables were standardized using 15-minute interpolation. After 40 optimization rounds, annual energy consumption per square meter dropped by 29.3% from 315 kWh/m2 to 223 kWh/m2. The lifecycle cost increase for occupants was limited to 3.7%, and discomfort hours were reduced to under 70 hours per year. Analysis of Pareto optimal solutions shows that the envelope U-value ranges from 1.05 to 1.57 W/m2K, and nighttime ventilation rate ranges from 2.1 to 3.6 h-1, both closely linked to energy performance. The results confirm that the integrated algorithm framework offers good scalability, strong performance, and technical feasibility for green building design. This platform provides a reliable decision-support tool for design engineers and sustainability practitioners, enabling accurate, data-driven delivery of energy-efficient buildings.
[AI-127] Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
链接: https://arxiv.org/abs/2606.01224
作者: Liu Qiong,Li Zhengbo
类目: Artificial Intelligence (cs.AI)
备注: 12 pages,5 figues
Abstract:Early detection of at-risk students and timely academic intervention pose major challenges in advanced mathematics education, where complex conceptual hierarchies and nonlinear learning trajectories often hold back students’ academic performance. This study adopts multimodal data analytics to build a dynamic framework for learning behavior prediction and academic early warning. It constructs a hierarchical knowledge graph ontology, realizes adaptive edge weighting according to problem difficulty and student performance, and combines heterogeneous graph attention with temporal sequence modeling to capture students’ evolving knowledge states. Empirical tests on semester-long multimodal datasets prove that this method can accurately identify high-risk students and effectively track error propagation. Targeted interventions greatly improve students’ knowledge mastery and reduce academic risks. The results verify that integrating knowledge graph analytics with multimodal temporal modeling can deliver more efficient and personalized learning support for advanced mathematics education.
[AI-128] Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing
链接: https://arxiv.org/abs/2606.01221
作者: Shermin Shahbazi,Hossein Mohammadi,Mohsen Afsharchi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 52 pages, 20 figures, accepted at Expert Systems with Applications
Abstract:Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.
[AI-129] Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling
链接: https://arxiv.org/abs/2606.01220
作者: Guang Lin,Shikui Tu,Lei Xu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 figures
Abstract:Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.
[AI-130] Can LLM Agents Sustain Long-Horizon Organizational Dynamics?
链接: https://arxiv.org/abs/2606.01199
作者: Xuancheng Zhu,Yang Yue,Shuaibing Wan,Zihan Dou,Xiaohan Zhang,Yongrui Liu,Guoshun Nan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language agents are increasingly used for social simulation, yet it remains unclear whether they can sustain coherent behavior in structured organizations, where goals must propagate through hierarchy, tasks depend on prior execution, and artifacts accumulate over long horizons. We formulate long-horizon organizational simulation as a memory-centered coordination problem and introduce TaskWeave, a hierarchical agentic framework that maintains planning states through a Formulate-Partition-Diagnose-Align cycle and grounds execution through dependency-aware trace memory. We evaluate TaskWeave in a year-long IT company simulation and compare it with other multi-agent frameworks on organizational coherence, execution grounding, and downstream enterprise NLP utility. Experiments show that TaskWeave supports coherent and long-horizon organizational dynamics while producing grounded artifacts and adapting to external environments. These findings suggest that structured simulation memory is a key mechanism for building reliable LLM-based organizational simulators.
[AI-131] he Case for Model Science: Verify Explore Steer Refine
链接: https://arxiv.org/abs/2606.01189
作者: Przemyslaw Biecek,Luca Longo,Jianlong Zhou,Thomas Fel,Andreas Holzinger,Wojciech Samek
类目: Artificial Intelligence (cs.AI)
备注: Follow up on arXiv:2508.20040
Abstract:We argue that the AI community is now ready to move beyond benchmarking and consolidate scattered efforts in model analysis into a systematic discipline, a direction we term Model Science. Complex AI models now serve billions of users, yet our understanding of how they work lags far behind our ability to deploy them. Decades of benchmark-driven research have delivered remarkable progress: extensive leaderboards, a wide range of performance metrics, tracking capability gains across diverse tasks; yet this success has also revealed the limits of benchmarks as they tell us whether models perform but not why they succeed or fail, they miss critical failure modes, such as hallucinations or shortcuts. Precedents from established sciences point the way forward: cognitive science shows that understanding complex systems requires complementary levels of analysis; neuroscience demonstrates that deep study of single cases reveals what population studies miss; medicine teaches that specialised training must develop alongside research practice; and agriculture models how shared infrastructure and principles enable cumulative progress. These lessons inform three foundations for Model Science. First, we propose to consolidate research around four functional perspectives: Verify, Explore, Steer, and Refine that address complementary questions about model behaviour. Second, we discuss the required infrastructure for cumulative knowledge: catalogues of datasets, models and findings. Third, we highlight the need for deep analysis of individual model instances, not just model families, because single cases can reveal what population studies miss.
[AI-132] "Skill issues: data-centric optimization of lakehouse agents
链接: https://arxiv.org/abs/2606.01185
作者: Nicole Rose Schneider,Davide Ghilardi,Giacomo Piccinini,Jacopo Tagliabue
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Coding agents are becoming users of data infrastructure, but their success depends not only on model quality: it also depends on the skills and environment files that teach agents how to use a system. We study how to optimize these artifacts for agents operating on a branching lakehouse, Bauplan. In our setting, headless APIs and Git-like data primitives expose data workflows through code, branches, commits, and merges. Our central observation is that a branching lakehouse turns data-agent evaluation from an output-matching problem into a state-verification problem: agent-generated pipeline code induces concrete, inspectable lakehouse changes. We present a data-centric optimization pipeline that generates task-verifier pairs, executes candidate skills in isolated sandboxes, and scores trajectories using both trace-level signals and programmatic checks over lakehouse state. In a preliminary evaluation on 25 tasks, optimized skills improve accuracy by 31.9%. These results suggest that write-path data workflows provide a useful substrate for optimizing agent skills beyond read-only tasks.
[AI-133] Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies
链接: https://arxiv.org/abs/2606.01179
作者: Biswajeet Sahoo,Debadutta Patra
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining 90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.
[AI-134] AI From the Margins (AIM): Rethinking Participatory AI Design Through the Lived Experience of Minoritized Communities AAAI
链接: https://arxiv.org/abs/2606.01171
作者: Tijs Portegies,Laureanne Willems,Maaike Harbers,Giovanni Sileno,Roland van Dierendonck,Mayesha Tasnim,Lotte Willemsen,Sennay Ghebreab
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under review at the AAAI/ACM Conference on AI, Ethics, and Society (AIES 2026)
Abstract:Artificial intelligence (AI) can reproduce and amplify the structural inequities faced by minoritized communities. Participatory AI has been proposed as a response, but participation typically starts after problem definitions and success criteria have been set, leaving limited room for minoritized communities to reshape what an AI system is for. We propose AI From the Margins (AIM): a methodological stance that articulates the conditions under which lived experiences of minoritized communities can be elicited, centered, and carried forward to inform participatory AI design. AIM is not a fixed protocol; it articulates a set of preconditions that can be enacted through different techniques in different settings. We applied AIM in a Dutch healthcare context in eight sessions with 13 women and non-binary people of color and five municipal policy workers, namely through (1) narrative elicitation using the Biographic Narrative Interpretive Method (BNIM); (2) co-constructed rule-making; (3) participants’ determination of whether, where, and how AI should be involved; and (4) translating lived experience into AI policy through dialogue with policymakers. In their reflections on the sessions, participants described the engagement as substantive and called for its continuation, demonstrating how preparatory orientation fundamentally grounded in lived experience shapes what participatory AI design is for.
[AI-135] Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts ICLR2026
链接: https://arxiv.org/abs/2606.01162
作者: Ya Shen,Gang Chen,Hui Ma,Mengjie Zhang
类目: Artificial Intelligence (cs.AI)
备注: This paper has been accepted by the Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract:Workflow scheduling in cloud computing demands the intelligent allocation of dynamically arriving, graph-structured workflows with varying deadlines onto ever-changing virtual machine resources. However, existing deep reinforcement learning (DRL) schedulers remain limited by rigid, single-path inference architectures that struggle to handle diverse scheduling scenarios. We introduce \textbfDEFT (\textbfDeadline-p\textbfErceptive Mixture-o\textbfF-Exper\textbfts), an innovative DRL policy architecture that leverages a specialized mixture of experts, each trained to manage different levels of deadline tightness. To our knowledge, DEFT is the first to introduce and validate a Mixture-of-Experts architecture for dynamic cloud workflow scheduling. By adaptively routing decisions through the most appropriate experts, DEFT is capable of meeting a broad spectrum of deadline requirements that no single expert can achieve. Central to DEFT is a \textbfgraph-adaptive gating mechanism that encodes workflow deadlines and DAGs, task states, and VM conditions, using cross-attention to guide expert activation in a fine-grained, deadline-sensitive manner. Experiments on dynamic cloud workflow benchmarks demonstrate that DEFT significantly reduces execution cost and deadline violations, outperforming multiple state-of-the-art DRL baselines.
[AI-136] Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification
链接: https://arxiv.org/abs/2606.01160
作者: Shihao Ji,Haotao Tan,Zihui Song,Mingyu Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model’s token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textitLeibniz, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.
[AI-137] When Data Is Scarce: Scaling Sparse Language Models with Repeated Training ICML2026
链接: https://arxiv.org/abs/2606.01155
作者: Boqian Wu,Qiao Xiao,Patrik Okanovic,Tomasz Sternal,Maurice van Keulen,Mykola Pechenizkiy,Elena Mocanu,Torsten Hoefler,Decebal Constantin Mocanu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML2026
Abstract:Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held-out dense-equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data-limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3. Resource trade-offs: With fixed data, loss-optimal sparsity is moderate ~ 50%, while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity. Our code is available at: this https URL.
[AI-138] ASE-26: a curriculum for agent ic software engineering as a discipline
链接: https://arxiv.org/abs/2606.01152
作者: Mikael Gorsky
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 12 pages, 20 references. Companion paper to the ASE-26 curriculum deposited on Zenodo at doi: https://doi.org/10.5281/zenodo.20468021 . Part 1 of a planned series of two pre-prints on the curriculum and its conceptual core
Abstract:The work of a professional software engineer has begun to consist, increasingly, of directing agents rather than writing code, and the empirical evidence for the shift is now several years deep. Anthropic’s Economic Index puts automation at 79 per cent of Claude Code interactions [2]; Handa and colleagues at Anthropic find AI exposure for Computer Programmer tasks at approximately 75 per cent of the role’s distinct activities [3]; Brynjolfsson and colleagues at Stanford’s Digital Economy Lab report a 13 per cent relative decline in employment for workers aged 22 to 25 in occupations most exposed to AI [4]. The shift is also unfinished, and the academic literature on agentic software engineering converges on the finding that the missing capability is not better models but structured practitioner discipline. This paper presents ASE-26, a comprehensive undergraduate curriculum for agentic software engineering as a discipline, deposited as a citable reference on Zenodo under CC BY-ND 4.0 [12]. The paper sets out the discipline framing the curriculum rests on, the conceptual contributions it makes (most importantly, the evolutionary spiral as the operational form of the co-evolution of intent and build), the twenty-one-module structure that organises the discipline for teaching, the pedagogical commitments that follow from grading work co-produced with an agent, what graduates leave with, and how the discipline as taught is designed to outlast the specific capabilities of today’s models. The position the paper takes is that the practitioner skills the industry currently lacks are precisely the skills the discipline names, and that structured undergraduate curricula in agentic software engineering are the principal mechanism by which the gap closes.
[AI-139] Reasoning 4Sciences: Bridging Reasoning Language Models to All Scientific Branches
链接: https://arxiv.org/abs/2606.01145
作者: Teddy Ferdinan,Bartłomiej Koptyra,Mikołaj Langner,Tomasz Adamczyk,Łukasz Radliński,Maciej Markiewicz,Aleksander Szczęsny,Stanisław Woźniak,Tymoteusz Romanowicz,Dzmitry Pihulski,Mateusz Zbrocki,Mateusz Śmigielski,Michał Rajkowski,Mateusz Biedka,Konrad Kiełczyński,Konrad Wojtasik,Jacek Duszenko,Jan Eliasz,Piotr Matys,Michał Bernacki-Janson,Maria Bellaniar Ismiati,Latius Hermawan,Wiktoria Mieleszczenko-Kowszewicz,Anna Kubicka-Sowinska,Grzegorz Chodak,Karol Postawa,Paweł Zyblewski,Tomasz Szandała,Łukasz Sterczewski,Adrian Chajec,Pawel Niewiadomski,Piotr Gruber,Marcin Wdowikowski,Sławomir Czarnecki,Bartłomiej Kryszak,Dominik Drabik,Tomasz Kajdanowicz,Kamil Mamak,Paweł Preś,Katarzyna Paczkowska,Joachim Sobczuk,Tomasz Zięba,Jan Kocoń,Maciej Piasecki,Przemysław Kazienko
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Reasoning Language Models (RLMs) are rapidly emerging as powerful tools for scientific research, their impact is primarily concentrated in “hard science” fields. The slow – or lack of – adoption of RLMs in other branches of science is causing a widening gap in research productivity. In this survey, we provide the first comprehensive analysis of RLM adoption across 28 scientific disciplines following the classification used by the European Research Council (ERC), spanning the Social Sciences and Humanities, Physical Sciences and Engineering, and Life Sciences. We examine how RLMs are developed, evaluated, and applied across disciplines. Furthermore, we introduce a maturity-oriented assessment framework based on available domain-specific development and evaluation resources, revealing substantial disparities in RLM maturity that become even more pronounced when only publicly available resources are considered. Finally, we highlight current implementation paradigms that are gaining popularity across disciplines, current challenges, and future directions in enabling RLM adoption across science.
[AI-140] SkillRevise: Improving LLM -Authored Agent Skills via Trace-Conditioned Skill Revision
链接: https://arxiv.org/abs/2606.01139
作者: Yuxuan Liu,Zhaochen Su,Lingyun Xie,Yuhao Zhang,Qing Zong,Jiahe Guo,Zhongwei Xie,Yiyan Ji,Yauwai Yim,Hongyu Luo,Xiyu Ren,Ruan Chenyu,Haoran Li,Yangqiu Song
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 4 figures
Abstract:Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent’s success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.
[AI-141] AMP: A Vendor-Neutral Wire Format for Agent Memory Operations MICRO
链接: https://arxiv.org/abs/2606.01138
作者: Thamilvendhan Munirathinam
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 17 pages, 1 figure, 6 tables. Reference implementation with 5 backend adapters (sqlite-vec, mem0, Letta, Cognee, pgvector), governance UI, microbench, adversarial-fusion experiment, 16-scenario conformance suite, threat model, and preliminary LongMemEval + LoCoMo numbers. Code at this https URL . Companion to arXiv:2604.18248 (Prompt Injection Detection)
Abstract:Agent-memory frameworks - mem0, Letta/MemGPT, Cognee, Zep/Graphiti, MemoryOS, MemTensor - each ship their own SDK, storage layout, and operational vocabulary. There is no shared wire format: every integration is bespoke, every migration rebuilds memory from scratch, and no framework ships a governance surface that lets a human review writes before they enter long-term storage. We present memorywire, a JSON-Schema 2020-12 wire format for five memory operations (remember, recall, forget, merge, expire) over four memory types (semantic, episodic, procedural, emotional), with a MemoryStore interface, a fan-out router, and an optional HITL governance channel. We describe an open-source reference implementation with five backend adapters (sqlite-vec, mem0, Letta, Cognee, pgvector); a microbenchmark on a 100-fact / 50-query labelled corpus achieving recall@5 = 1.000 on the 42 labelled queries with ingest p50 = 37.8 ms and recall p50 = 40.6 ms; an adversarial-fusion experiment showing Reciprocal Rank Fusion holds recall@5 = 1.000 across a 1-of-N rank-0 injection sweep (K in 0,5,…,50) where max fusion collapses to 0.500 with 80% leak at K = 5; and a 16-scenario cross-adapter conformance suite passing 68 of 80 cells with zero failures. The contribution is not a new algorithm; it is a packaging of established components (RRF, FSMs, STM/LTM consolidation, diff-and-approve workflows) into a venue-neutral protocol with an empirically validated reference, positioned to compose with the Model Context Protocol rather than compete with it.
[AI-142] Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG -based Fact-Checking
链接: https://arxiv.org/abs/2606.01120
作者: Yuxi Sun,Wenbo Shang,Wei Gao,Xin Huang,Jing Ma
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In RAG-based fact-checking, LLMs are increasingly used as verifiers to check given claims against retrieved evidence. Their parametric knowledge can induce pre-evidence tendencies that may conflict with the retrieved context, yet existing evaluation frameworks do not characterize such prior-context discrepancy or measure how verifiers arbitrate between parametric and contextual signals. We introduce \textscPAVE (\emphPrior-Aware Verifier Evaluation), a diagnostic testbed that stratifies an LLM verifier into four epistemic states based on the correctness and confidence of its pre-evidence prior and evaluates its arbitration behavior on this new benchmark, i.e., whether it persists in correct prior under misleading evidence, and whether it corrects wrong prior when accurate evidence is provided. Experiments across seven LLMs reveal unreliable and highly model-dependent prior-context arbitration, highlighting the importance of verifier selection for real-world RAG-based fact-checking applications. Based on these findings, we propose a lightweight JSD-based test-time arbitration method that improves factual reliability without modifying the underlying model, achieving competitive performance across diverse LLM families.
[AI-143] HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces ICML2026
链接: https://arxiv.org/abs/2606.01117
作者: Nasib Ullah,Jinbin Zhang,Jean Lucien Randrianantenaina,Erik Schultheis,Rohit Babbar
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026 Regular
Abstract:Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias – encouraging related labels to share feature subsets – while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to 4.4\times speedup in the forward pass and up to 25\times speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.
[AI-144] Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context
链接: https://arxiv.org/abs/2606.01101
作者: Shihao Ji,Mingyu Li,Zihui Song
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, 2 tables. Preprint
Abstract:The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive Bayes Cognitive Engine (NBCE) parallelizes long-context inference by chunking documents and routing to the lowest-entropy chunk at each decoding step. This hard-selection strategy causes semantic fragmentation during cross-chunk reasoning, as abrupt routing changes between adjacent tokens disrupt the model’s contextual grounding. We present Soft-NBCE, a lightweight extension that replaces discrete chunk selection with soft entropy-weighted chunk fusion. A temperature-scaled Softmax over predictive entropies assigns continuous weights to all chunks, enabling log-space aggregation across chunk-conditioned distributions. To partially compensate for the conditional independence assumption introduced by chunking, we propose Consistency Distillation, a LoRA-based self-distillation that constrains the chunked logit distribution toward a full-context teacher via KL-divergence. On LongBench multi-hop benchmarks, Soft-NBCE with Consistency Distillation improves consistently over NBCE-style baselines (MuSiQue F1: 0.310 vs.\ 0.275 for Vanilla NBCE; HotpotQA F1: 0.479 vs.\ 0.427) while maintaining retrieval accuracy (NIAH-32K: 0.909) at O(L^2/n) peak memory. Comments: 7 pages, 3 figures, 2 tables. Preprint Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.01101 [cs.LG] (or arXiv:2606.01101v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.01101 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-145] Implicit Drifting Policy: One-Step Action Generation via Conditional Expert Geometry
链接: https://arxiv.org/abs/2606.01098
作者: Zemin Yang,Yaoyu He,Yiming Zhong,Yuhao Zhang,Xinge Zhu,Yao Mu,Qingqiu Huang,Yuexin Ma
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative action policies based on diffusion or flow matching excel in behavior cloning, yet their iterative sampling is prohibitive for high-frequency robot control. While recent one-step formulations alleviate this latency, they inevitably discard the intermediate trajectory evolution that provides crucial action correction. Directly recovering this mechanism by explicitly estimating a training-time drifting field is mathematically ill-posed due to extreme conditional demonstration sparsity. We introduce Implicit Drifting Policy (IDP), a one-step imitation learning framework that brings the training-time correction of Drifting into policy learning without explicit vector field estimation. IDP extracts a conditional expert geometry from the local variation of observation-similar expert actions, and compares it against a global reference geometry to isolate condition-specific constraints. This local geometric structure adaptively weights a scalar potential objective. Combined with an expert-proximal terminal evaluation, IDP directly enforces manifold constraints on the one-step generator during training. Extensive evaluations across 2D, 3D, and real-world manipulation tasks show IDP effectively maintains adherence to valid action manifolds, improving upon explicit drifting methods and achieving competitive performance with strong one-step baselines.
[AI-146] Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA
链接: https://arxiv.org/abs/2606.01095
作者: Hung Mai,Bin Zhu,Tuan Do
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.
[AI-147] CAREAgent : Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation
链接: https://arxiv.org/abs/2606.01094
作者: Ruihui Hou,Ziyue Huai,Chennuo Zhang,Ziyan Liu,Siran Zhao,Yao Yu,Jie Zhai,Tong Ruan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical order generation serves as a critical bridge between clinical decision-making and real-world practice, translating medical decisions into concrete and executable orders. Existing agents mainly focus on coarse-grained decisions and overlook the fine-grained, executable information required for clinical orders. To address this gap, we propose CAREAgent, an agent for clinical order generation. To support its training, we introduce a two-stage agentic reasoning data construction method. First, we design an agent framework that constructs verifiable reasoning trajectories aligned with realistic clinical tool usage. Second, we filter reasoning trajectories by format compliance, order validity, and clinical plausibility. Building on the constructed data, the model is first trained via supervised fine-tuning to acquire fundamental reasoning formats and medical knowledge, and is subsequently optimized through reinforcement learning with multi-dimensional reward functions to enhance complex clinical reasoning capabilities. Experiments on multiple benchmarks demonstrate the effectiveness of CAREAgent. On ClinicalBench (unseen during training), CAREAgent improves the F1 score by 5.05%, 2.09%, and 0.86% over the single-agent, multi-agent, and agentic reasoning methods, respectively.
[AI-148] A Fiber Criterion for Representation Identifiability in Supervised Learning
链接: https://arxiv.org/abs/2606.01092
作者: Vasileios Sevetlidis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition f=c\circ h , supervised evidence constrains the composite map f but need not determine the representation-head factorization (h,c) . This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection (h,c)\mapsto c\circ h , equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.
[AI-149] Strong Stochastic Flow Maps
链接: https://arxiv.org/abs/2606.01086
作者: Sam McCallum,Zander W. Blasingame,Timothy Herschell,Niklas Rindtorff,Alexander Tong,James Foster
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.
[AI-150] MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing
链接: https://arxiv.org/abs/2606.01084
作者: Shiyan Liu,Bohan Tan,Yaoxin Wu,Yan Jin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the D_4 symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.
[AI-151] hinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks
链接: https://arxiv.org/abs/2606.01080
作者: Dhruv Saini,Rohan Pandey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbfThinkSwitch, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \ 2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.
[AI-152] Before the Model Learns the Bug:Fuzzing RLVR Verifiers
链接: https://arxiv.org/abs/2606.01066
作者: Jaideep Ray
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) replaces human preference labels with executable reward functions such as math answer checkers, JSON tool-call validators, and code unit-test harnesses. That makes the reward partly a software artifact: if the verifier is wrong, optimization can learn the bug. We study this failure mode with a lightweight verifier-fuzzing framework that generates adversarial completions, compares buggy and stricter reference verifiers, logs paired decisions, and reports false-positive, false-negative, disagreement, exploit, and uncertainty metrics.
[AI-153] Leyline: KV Cache Directives for Agent ic Inference
链接: https://arxiv.org/abs/2606.01065
作者: Bole Ma,Jan Eitzinger,Harald Koestler
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper’s focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.
[AI-154] MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention CVPR2026
链接: https://arxiv.org/abs/2606.01063
作者: Ruoxuan Zhang,Qiaoqiao Wan,Zhengguang Wang,Chenghao Yu,Hongxia Xie,Jianlong Fu,Wen-Huang Cheng
类目: Artificial Intelligence (cs.AI)
备注: Extended version of the CVPR 2026 paper MindPower: Enabling Theory-of-Mind Reasoning in VLM-based Embodied Agents
Abstract:Theory of Mind (ToM) enables an agent to reason about another actor’s beliefs, goals, and intentions, which is essential for human-centered embodied assistance. Existing ToM benchmarks have advanced text and multimodal mental-state recognition, but they mostly evaluate offline question answering or final action prediction. They do not fully test whether an embodied agent can stay connected to a changing environment, update actor-specific beliefs, decide when reasoning is needed, and intervene only when help is useful. Building on MindPower, we extend robot-centric ToM reasoning to a real-time closed-loop setting and introduce MindClaw, a framework for embodied mental-state reasoning with precision intervention. MindClaw connects multi-source inputs, belief memory, an embodied cognitive trigger skill, mental reasoning, and action generation, allowing the agent to output helpful actions at the right time while remaining silent when intervention is unnecessary. Experiments show that direct VLM baselines struggle with task awareness and intervention calibration, while MindClaw achieves the best overall performance, demonstrating the importance of trigger-skill optimization for closed-loop embodied ToM assistance.
[AI-155] DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts ICML2026
链接: https://arxiv.org/abs/2606.01062
作者: Jiarui Feng,Hanqing Zeng,Karish Grover,Ruizhong Qiu,Yinglong Xia,Qiang Zhang,Qifan Wang,Ren Chen,Dongqi Fu,Jiayi Liu,Zhoukai Zhao,Xiangjun Fan,Benyu Zhang,Yixin Chen
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Mixture-of-Experts (MoE) models have become a leading approach for decoupling parameter count from computational cost in large language models, yet effectively scaling MoE performance remains a challenge. Prior work shows that fine-grained experts enlarge the space of expert combinations and improve flexibility, but they also impose substantial routing overhead, creating a new scalability bottleneck. In this paper, we explore a complementary axis for scaling – how expert outputs are aggregated. We theoretically show that replacing the standard weighted-summation aggregation with structural aggregation expands the expert-combination space without altering the experts or router, and enables possible multi-step reasoning within a single MoE layer. To this end, we propose DAG-MoE, a sparse MoE framework that employs a lightweight module to automatically learn the optimal aggregation structure among the selected experts. Extensive experiments under standard language modeling settings show that DAG-MoE consistently improves performance in both pretraining and fine-tuning, surpassing traditional MoE baselines.
[AI-156] AnyEdit: Adaptive Long-Form Knowledge Editing via Bayesian Surprise ICML2026
链接: https://arxiv.org/abs/2606.01053
作者: Bowen Tian,Caixue He,Jiemin Wu,Jingying Wang,Wenshuo Chen,Zexi Li,Yutao Yue
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Editing complex, long-form knowledge in Large Language Models remains a significant challenge due to the difficulty of maintaining generation coherence. Existing autoregressive methods like AnyEdit alleviate length constraints but rely on Fixed-window Chunking, which disregards logical structure and compromises consistency. To address this, we present AnyEdit++, a structure-aware framework incorporating Bayes-Chunk, an adaptive segmentation mechanism that dynamically identifies semantic boundaries based on Bayesian Surprise. We underpin this approach with a theoretical framework establishing two key principles: (1) Structural Independence: we prove that cross-segment interference is minimized when anchor keys are geometrically orthogonal (a condition naturally satisfied by our surprisal-based boundaries but violated by fixed windows), and (2) Causal Locality: we demonstrate that updates injected at these semantic peaks yield strictly superior control compared to arbitrary split points. Extensive experiments across mathematical reasoning, code generation, and narrative tasks demonstrate that AnyEdit++ achieves superior performance and robustness compared to state-of-the-art baselines, validating that structural awareness is critical for effective long-form knowledge editing.
[AI-157] ravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM -Powered Travel Planning Agents KDD2026
链接: https://arxiv.org/abs/2606.01046
作者: Weiyi Chen,Shuaixiong Wang,Ziyun Gao,Kaichun Hu,Wangze Ni,Shimin Di,Chen Jason Zhang,Lei Chen
类目: Artificial Intelligence (cs.AI)
备注: 31pages, 8 figures, accepted by KDD 2026
Abstract:The development of Large Language Models (LLMs) has significantly improved travel planning applications, yet evaluating such models is limited by existing benchmarks’ limitations: 1) overemphasis on constraint compliance, neglecting multi-dimensional qualities like spatio-temporal cost; 2) datasets lacking real-world authenticity and coverage in key areas (e.g., lodging, transport); and 3) isolated daily plan assessments that miss critical details (e.g., the impact of daily accommodation and visit pacing) needed for entire plan’s evaluation. To address this gap, we introduce TravelEval, a realistic and comprehensive benchmark. TravelEval features 1) a novel six-dimensional evaluation framework to holistically assess plans across accuracy, compliance, temporality, spatiality, economy, and utility dimensions; 2) a highly realistic data sandbox with precise accommodation pricing and authentic intercity transportation data; and 3) a simulation-based global evaluation method that emulates complete travel plans with API-integrated geographic information and fine-grained queuing time. Evaluating 12 mainstream approaches with TravelEval reveals several valuable insights, such that LLMs struggle with globally-optimized multi-dimensional planning (especially in spatio-temporal reasoning and budget compliance), and agentic reasoning strategies offer no consistent improvement. Concisely, TravelEval facilitates travel plan evaluation via grounded spatio-temporal emulation and comprehensive metrics, providing a robust foundation for advancing LLM-powered travel planning research and applications.
[AI-158] Plausibility Is Not Prediction: Contrastive Evidence for LLM -Based Cellular Perturbation Reasoning
链接: https://arxiv.org/abs/2606.01042
作者: Xinyu Yuan,Xixian Liu,Jianan Zhao,Yashi Zhang,Hongyu Guo,Jian Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as “virtual cell” simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning
[AI-159] OPD: Rethinking the Advantage Design for On-Policy Distillation
链接: https://arxiv.org/abs/2606.01039
作者: Hanyang Zhao,Haoxian Chen,Han Lin,Genta Indra Winata,David Yao,Wenpin Tang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.
[AI-160] riLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
链接: https://arxiv.org/abs/2606.01033
作者: Bohan Yang,Yijun Gong,Zhi Zhang,Ge Zhang,Wenpeng Xing,Meng Han
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:When a language model hallucinates, the final answer is wrong, but the mistake is not necessarily invisible inside the model. Different internal pathways may remain uncertain, disagree in how quickly they sharpen, or commit to competing continuations before the output is produced. We introduce TriLens, a white-box detector that turns this intuition into a compact representation: at every layer, it reads the multi-head self-attention output, the feed-forward output, and the residual stream through the model’s own logit lens, then records only the entropy of each readout. The resulting 3L-dimensional trajectory describes how certainty forms across depth and across modules, without storing high-dimensional hidden states or sampling multiple generations. This simple signal yields a strong detector across instruction-tuned LLMs and QA benchmarks, and our analyses show that the three module-wise entropy trajectories provide complementary evidence. TriLens suggests that hallucination detection can benefit from tracking how internal computation settles, not only what the final layer predicts.
[AI-161] ackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation
链接: https://arxiv.org/abs/2606.01020
作者: Minjing Shi,Junling Wang,Jingwei Ni,Sankalan Pal Chowdhury,Mrinmaya Sachan
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted to Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Long Paper), Main Conference
Abstract:Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor laypeople and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking these pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.
[AI-162] AI-IoT-Robotics Integration: Survey of Frameworks Emerging Trends and the Path Toward Connected Robotics
链接: https://arxiv.org/abs/2606.01015
作者: Ranulfo Bezerra,Satoshi Tadokoro,Kazunori Ohno
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
备注: 15 pages, 3 figures, 3 tables. Published in IEEE Internet of Things Journal
Abstract:The convergence of Artificial Intelligence, the Internet of Things, and Robotics is no longer a futuristic vision; it is rapidly becoming the foundation of real-time, intelligent, and context-aware systems. AI enables perception and reasoning, IoT provides scalable sensing and communication, and robotics delivers embodied actuation. Despite significant progress in pairwise combinations such as AIoT and the Internet of Robotic Things (IoRT), there remains a lack of unified design frameworks that fully integrate all three. This survey synthesizes the state-of-the-art across these domains, emphasizing the emerging role of Small Language Models (SLMs) at the edge and Large Language Models (LLMs) in the cloud for distributed cognition and autonomous decision-making. We propose a modular system architecture that aligns with these trends, analyze persistent gaps in interoperability and feedback control, and classify existing work by integration depth. Our review highlights how hybrid SLM-LLM systems, when coupled with IoT infrastructure and robotic agents, can address challenges in real-time adaptation, scalability, and reliability. This work offers a conceptual and technical roadmap for designing next-generation AI-IoT-Robotic ecosystems that are modular, interpretable, and capable of learning within dynamic environments, paving the way for the emerging paradigm of Connected Robotics and Physical AI.
[AI-163] Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions
链接: https://arxiv.org/abs/2606.01013
作者: Di Wu
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 12 pages, 12 figures
Abstract:Research is advancing faster than ever with artificial intelligence (AI); and so are the corresponding research papers. The exploding volume of AI-generated papers have put a strain to peer review, leading to the usage of AI-generated review, potentially wide yet sneaky. However, relevant ethical concerns about confidentiality, quality, and fairness are raised and no consensus has been reached in the broad research community. We expect the debate to continue for a while, but in the meantime, we ask an alternative, practical question: \textitcan AI review improve paper drafting? We study 20 computer architecture papers, with varying levels of submission lineage, to expose how well AI review aligns with human review, quantified by a set of metrics we define. To conduct the case study, we build a web UI-integrated tool, \emphAI-Paper-Review, that generates structured AI review of a draft paper, available at this https URL. This tool selects several AI reviewers from a diverse pool of AI reviewers and clusters and ranks their comments based on commonality and importance of review comments. It also allows to align AI comments with human comments to facilitate metric-based validation. The case study shows that AI review can cover a significant fraction of human-raised issues, but also raises issues missing in human review. This paper is not intended to encourage using AI for peer review at the current stage, but to study that (1) how AI review can improve paper drafting and (2) the potential and limitation of AI-based peer review. The release of the tool and the case study data is intended to instigate future research on this topic. Misuse for peer review would violate the ethics policies from major academic venues. Comments: 12 pages, 12 figures Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2606.01013 [cs.AI] (or arXiv:2606.01013v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.01013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-164] Property Prediction of Stacked Bilayer Materials: A Multimodal Learning Approach IJCAI2026
链接: https://arxiv.org/abs/2606.01012
作者: An Vuong,Minh-Hao Van,Chen Zhao,Xintao Wu
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: Accepted to the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)
Abstract:AI for materials science is a critical topic within AI for science, aiming to accelerate materials discovery and produce accurate property predictions. Bilayer 2D material stacking is essential for exploring new materials with novel functions and inherent phenomena, enabling the creation of new 2D bilayers for diverse real-world applications. Research on bilayer vdWs materials has made significant progress from experimental and computational perspectives. Various bilayer materials have been successfully synthe sized experimentally and the increasing utilization of high-throughput computing technology has con structed several computational two-dimensional materials databases. However, the use of AI to model bilayer stacking and predict new properties remains underexplored, necessitating further research studies. In this work, we propose a novel multimodal learning approach to study the interfaces between dissimilar materials that jointly enable new or multiple functions, and to predict new properties arising from the vertical integration (stacking) of different functional material layers under given configurations. Comprehensive experiments demonstrate the effectiveness and efficiency of our approach compared to baseline methods. Our code is available at this https URL ml.
[AI-165] FVSpec: Real-World Property-Based Tests as Lean Challenges
链接: https://arxiv.org/abs/2606.01008
作者: Quinn Dougherty,Max von Hippel,Hazel Shackleton,Mike Dodds
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:We present a benchmark for evaluating AI models and agents on real-world formal software verification tasks. We first scrape 11,039 property-based tests (PBTs) from real-world Python repositories, then automatically translate 2,772 of them (25%) into 9,415 Lean 4 specifications with sorry placeholders (about 3 formalizations/PBT; we retain multiple attempts when none dominates on quality metrics). Translating PBTs into Lean specifications is challenging: it requires modeling Python semantics in Lean, inferring the logical property encoded in an imperative PBT, and handling the inherent difficulties of dependently-typed programming in a seldom-used language. We describe a three-agent LLM pipeline for transpiling PBTs into Lean specifications, evaluate coverage and quality metrics, and provide baselines for proof generation using several automated and model based approaches. All code (scraper and agents) and data (PBTs and Lean specifications) are open source. Our benchmark aims to drive progress on the underexplored problem of AI-assisted formal verification of real-world software, which is of increasing interest as AI produces more and more of the world’s code.
[AI-166] Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference
链接: https://arxiv.org/abs/2606.01007
作者: Zhiyao Xu,Aoxue Liu,Zhanjie Ding,Dan Zhao,Yong Jiang,Qing Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emphTask-Aware Coactivation Grouping (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emphGeneric Expert Shared Replication (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.
[AI-167] Subliminal Learning Is Steering Vector Distillation
链接: https://arxiv.org/abs/2606.00995
作者: Camila Blank,Agam Bhatia,Senthooran Rajamanoharan,Arthur Conmy,Neel Nanda
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Subliminal learning refers to a student language model acquiring a teacher’s traits (e.g. a system-prompted preference for owls) when fine-tuned on the teacher’s outputs, despite the outputs being semantically unrelated to those traits. It remains poorly understood how data without semantic meaning can transfer specific semantic traits. In this work, we show that subliminal learning is mediated by a single steering vector, i.e. a vector added to the model’s activations. Across two open-source models, we find that the teacher’s system prompt is well approximated by a steering vector, and that the student’s behavior is driven by learning an aligned vector over fine-tuning. System prompts that are not well approximated by steering vectors are not subliminally learned. This is a special case of steering vector distillation, in which a student trained on the outputs of a steered teacher learns to imitate that steering. We demonstrate steering vector distillation on a range of semantic and random vectors. Adding a semantic vector to a model’s activations can have both model-independent and model-specific (i.e. non-semantic) effects on its behavior, so generated data that is non-semantic can transmit a vector with semantic effects, enabling subliminal learning. This also explains why subliminal learning does not transfer between models. We find that adaptive optimizers are necessary for subliminal learning in language models: activation gradients on steered data carry a small but consistent component along the steering direction, and non-adaptive optimizers impede this by allowing outlier gradients to dominate.
[AI-168] Large Language Models in Transportation Systems Management and Operations: From Text Reasoning to Multi-modal Decision Support
链接: https://arxiv.org/abs/2606.00991
作者: Siyan Li,Zehao Wang,Jiachen Li,Kanok Boriboonsomsin,Matthew J. Barth,Guoyuan Wu
类目: Artificial Intelligence (cs.AI)
备注: Preprint version
Abstract:Transportation systems management and operations (TSMO) increasingly depends on timely interpretation of heterogeneous data, from various sensor streams, incident reports, traveler feedback, and visual observations. Large language models (LLMs), including emerging multi-modal large language models (MM-LLMs), provide a new mechanism for integrating these structured and unstructured inputs into operator-facing decision support. This survey paper reviews LLM- and MM-LLM-based applications in TSMO across three domains: transportation operations services (supply), mobility fleet services (demand), and data, modeling decision support. Using a PRISMA-guided screening process, we synthesize current studies while distinguishing operationally oriented applications from prototype and emerging concepts. We further identify recurring challenges in data heterogeneity, real-time inference, explainability, multi-modal fusion, and governance. Finally, we outline existing gaps and future directions in localized adaptation, edge deployment, benchmarking, and cross-agency collaboration. Overall, LLM-based systems appear most promising as a decision-support layer, with MM-LLMs offering particular value when heterogeneous text, visual, and sensor inputs must be integrated.
[AI-169] Prospect-Theory Behavior from Bellm an Optimality in MDPs with Catastrophic States
链接: https://arxiv.org/abs/2606.00970
作者: Yujiao Chen
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
备注:
Abstract:We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient \lambda^(S) 1 , and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action’s higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action’s lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau \bar\lambda that depends only on win probability p , payoff asymmetry r = |\Delta_\ell/\Delta_w| , and discount factor \beta , and matches numerical solutions to R^2 = 0.999 . The mechanism does not require asymmetric payoffs. Across a sweep of (p,\beta) at three asymmetry levels, the asymmetry share of \bar\lambda above unity has median 4.6% at r = 1.25 and rises to 13.9% at r = 2 , with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces V^ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student- t_3 , and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.
[AI-170] SS-ZKR: Spatial-Semantic Zero-Knowledge Routing for Privacy-Preserving Multi-Agent Collaboration
链接: https://arxiv.org/abs/2606.00962
作者: Hassan Touheed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundational agent interoperability standards, notably the Agent-to-Agent (A2A) protocol and the Model Context Protocol (MCP), have advanced multi-agent system communication, and complementary identity frameworks leveraging W3C Decentralised Identifiers (DIDs) and Verifiable Credentials (VCs) provide cryptographic agent authentication. However, no existing protocol supports content-based semantic routing of agent payloads across organisational trust boundaries without requiring the routing intermediary to decrypt the payload, which is a hard constraint in compliance-sensitive environments governed by GDPR, HIPAA, and MiFID II. We propose SS-ZKR, a three-mechanism privacy-preserving routing protocol designed as a complementary layer atop A2A/MCP. Mechanism I introduces blind routing via differentially private semantic intent vectors cryptographically bound to zero-knowledge proofs of payload-schema consistency. Mechanism II offers vector-weighted adaptive payload sanitisation with formal (epsilon, delta)-differential privacy for numerical fields and heuristic semantic aggregation for textual fields. Mechanism III presents a spatial-to-cryptographic policy compiler that translates visually defined trust-zone topologies into deterministic zero-knowledge access circuits. We provide a formal threat model, analyse information leakage bounds of intent vectors, present pseudocode for all three mechanisms, and give analytical complexity comparisons against TEE-based and homomorphic encryption-based routing baselines. SS-ZKR lets enterprises in financial services, healthcare, and defence orchestrate heterogeneous AI agents across regulatory boundaries without exposing proprietary data to routing infrastructure.
[AI-171] owards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition ICML2026
链接: https://arxiv.org/abs/2606.00959
作者: Wanlong Fang,Tianle Zhang,Wen Tao,Alvin Chan
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2026
Abstract:Understanding modality interaction in multimodal large language models (MLLMs) is central to reliable deployment. We introduce Partial Information Decomposition (PID) as a decision-level framework that separates unique, redundant, and synergistic contributions of sensory and linguistic inputs, beyond representation alignment and outcome-based evaluation. Across vision–language benchmarks, PID reveals recurring modality-use profiles: reasoning and grounding-oriented tasks tend to exhibit high synergy, whereas expert and knowledge-oriented tasks show stronger language-unique reliance. These profiles generalize across model families and predict sensitivity to modality-level interventions. We further extend PID to tri-modal systems with Sensory PID, treating language as a control variable to decompose video–audio information gain. Applied to omni-modal models, Sensory PID reveals a sensory synergy bottleneck dominated by visual information even on audio–visual fusion tasks. Finally, PID-guided reweighting provides initial evidence for improving multimodal reasoning and grounding performance.
[AI-172] Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction
链接: https://arxiv.org/abs/2606.00949
作者: Federica Tonti,Ricardo Vinuesa
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
备注:
Abstract:We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.
[AI-173] Silent Failures in Federated Personalization of Foundation Models
链接: https://arxiv.org/abs/2606.00947
作者: YongKyung Oh,Alex Bui
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term “Silent Failures.” These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning’s privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural divide. Federated benchmarks evaluate system performance but provide limited insight into model behavior, whereas centralized trustworthiness benchmarks assess behavior but require model access incompatible with federated privacy. We introduce a taxonomy of six silent failure modes arising from the interaction of foundation model personalization, dataset shift, and core federated constraints. Our analysis shows that privacy-preserving training alone is insufficient for trustworthy deployment. We conclude with a research agenda for privacy-preserving behavioral evaluation and propose that silent failures become a standard diagnostic category for trustworthy federated artificial intelligence.
[AI-174] Lodestar: An Online-Learning LLM Inference Router
链接: https://arxiv.org/abs/2606.00946
作者: Gangmuk Lim,Wanyu Zhao,Brighten Godfrey,Jiaxin Shan,Le Xu,Liguang Xie
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.
[AI-175] Benchmarking Security Risk Detection and Verification in Open Agent ic Skill Ecosystems
链接: https://arxiv.org/abs/2606.00925
作者: Ismail Hossain,Sai Puppala,Zhuoran Lu,Sajedul Talukder,Nan Jiang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Open agent platforms allow community contributors to publish reusable skills that agents can invoke at runtime. This extensibility also creates a supply-chain risk: malicious contributors can hide harmful behavior inside skills that appear benign under superficial inspection. However, existing defenses are hard to evaluate because there is no benchmark that measures both malicious-skill detection and runtime verification. We present SkillVetBench, a two-stage security vetting benchmark for open agentic skill ecosystems. The first stage performs semantic vetting over each skill’s natural-language specification to detect hidden malicious intent. The second stage executes flagged skills in an instrumented sandbox to observe runtime behavior and collect auditable evidence. We build a benchmark from confirmed malicious skills in the live OpenClaw ecosystem, including samples from the recent ClawHavoc supplychain campaign. Unlike static-only methods, SkillVetBench verifies detected threats with execution traces. Our experiments show that: (1) semantic-only and signature-based baselines are insufficient, missing up to 89% of malicious skills whose threats arise from natural-language instructions, multicomponent logic, or cross-component interactions; (2) runtime attacks are concentrated in a small set of high-permission primitives, especially exec, write_file, install_skill, and spawn; and (3) SkillVetBench provides case studies in which sandbox execution directly supports malicious verdicts with concrete runtime evidence.
[AI-176] Accuracy Stability and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
链接: https://arxiv.org/abs/2606.00920
作者: Yongxi Zhou,Lai Yun Choi,Jiaxi Wen,Wenbo Ye
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points – and the gap is largest precisely for mid-performing systems. We investigate this accuracy–stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage – a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.
[AI-177] Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers ACL2026
链接: https://arxiv.org/abs/2606.00902
作者: Yeqi Huang,Yue Chen,Yanwei Ye,Guanhao Su,Luo Mai
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 System Demonstrations Track. 8 pages, 6 figures
Abstract:General-purpose VLMs remain unreliable for biomedical research because valid answers in scientific papers depend on evidence split across figures, tables, charts, captions, and referring text. Existing post-training pipelines are bottlenecked by costly expert annotation and by synthetic data that drops this evidence structure. We present Ryze, a fully automated system that converts raw biomedical papers into an evidence-enriched training set and a domain-specialized VLM. Ryze synthesizes QA pairs with complete supporting evidence (visual element, caption, extracted structure, and referring paragraphs), reduces layout and OCR errors via chart/table-aware extraction and LLM-based cleansing, and applies a progress-gated post-training strategy combining supervised fine-tuning with reinforcement learning. Starting from Qwen3-VL-8B, Ryze produces BioVLM-8B at under USD 200, achieving 48.0% weighted accuracy on LAB-Bench, outperforming the base model by +12.6 percentage points (pp) and surpassing GPT-5.2 by +3.8 pp. We release Ryze as open source together with the trained BioVLM-8B model.
[AI-178] Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling ICML2026
链接: https://arxiv.org/abs/2606.00888
作者: Qiao Xiao,Boqian Wu,Patrik Okanovic,Tomasz Sternal,Maurice van Keulen,Elena Mocanu,Mykola Pechenizkiy,Decebal Constantin Mocanu,Torsten Hoefler
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML2026
Abstract:Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam-based optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory-Efficient Training (SMET), which stabilizes DST with optimizer warm-up and improves training progress through density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: this https URL.
[AI-179] Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG
链接: https://arxiv.org/abs/2606.00884
作者: Jiaxin Qing,Lexin Li
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.
[AI-180] ask diversity produces systematic transfer but inhibits continual reinforcement learning
链接: https://arxiv.org/abs/2606.00880
作者: Purab Seth,Neil Shah,Kunal Jha,Samuel J. Gershman,Max Kleiman-Weiner,Wilka Carvalho
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training – with frozen weights. Whether task diversity also improves an agent’s ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.
[AI-181] From Cues to Horizons: Dynamic Risk Horizon Profiling for Trajectory Prediction
链接: https://arxiv.org/abs/2606.00857
作者: Xinyi Ning,Zilin Bian,Dachuan Zuo,Semiha Ergan,Kaan Ozbay
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, submitted to IEEE Transactions on Intelligent Transportation Systems (T-ITS)
Abstract:Accurate and reliable vehicle trajectory prediction is essential for safe autonomous driving. Recent studies have incorporated safety risk into trajectory prediction to quantify dangers posed by surrounding agents. However, most risk-aware approaches use past risk information as a secondary signal to help guide decisions, overlooking its future evolution and uncertainty. In this paper, we propose a risk horizon profiling (RHP) module that incorporates a continuous, learnable potential field model for risk-aware trajectory prediction. The RHP module calculates the spatial-temporal proximity of surrounding objects to profile risk distributions across future horizons, which supports better trajectory prediction by adaptively identifying what human drivers perceive as critical moments. We evaluate our method on two datasets from different driving settings, highD for highway corridors and SHRP2 for urban streets, which cover diverse risk scenarios including safe, near-crash, and crash events. Compared to the baseline methods, our framework achieves a 25.0% reduction in 5s RMSE on the highD dataset and a 29.1% reduction in 5s minFDE on SHRP2. These results indicate strong performance for both short and long horizon prediction and robust generalization across highway and urban scenarios. The proposed method enables more realistic AV path planning and strategic selection, thereby supporting safer autonomous driving and more advanced driver-assistance systems. The source code for this work is available at: this https URL
[AI-182] Certificate-Guided Evaluation of Reinforcement Learning Generalization
链接: https://arxiv.org/abs/2606.00840
作者: Vignesh Subramanian,Đorđe Žikelić,Suguman Bansal
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work presents a logic-driven framework to evaluate the performance of reinforcement learning (RL) algorithms in their ability to generalize to unseen tasks. Our framework defines a family of inductive reach-avoid tasks, characterized by structural similarities in task dynamics, enabling evaluation of generalization capabilities. We introduce a neural certificate function that validates trajectories generated by RL algorithms by enforcing key conditions, thereby serving as a litmus test for RL generalization. We empirically demonstrate our method’s capability in certifying generalization for several state-of-the-art generalizable RL algorithms on challenging continuous environments. Our results show that a lower percentage of certificate function violations correlates with a higher number of test tasks successfully solved, highlighting the effectiveness of our framework in evaluating and distinguishing generalization capabilities of RL algorithms. This work provides a principled approach for benchmarking RL generalization. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00840 [cs.AI] (or arXiv:2606.00840v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.00840 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-183] Decoupled Behavioral Cloning for Scalable Inductive Generalization in RL from Specifications
链接: https://arxiv.org/abs/2606.00838
作者: Vignesh Subramanian,Subhajit Roy,Suguman Bansal
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Inductive generalization is a framework for reinforcement learning (RL) generalization in which inductively related task instances admit inductively related policies. Prior work captures this structure via a higher-order policy-evolution function learned directly with RL, but suffers from poor training scalability: as training tasks grow, aggregated reward feedback becomes noisy and conflicting, destabilizing training and weakening generalization. We propose DIBS, a decoupled behavioral cloning approach that separates learning task-specific policies from learning the evolution function. We first learn individual teacher policies per task via standard RL, then fit the evolution function via behavioral cloning on teacher-labeled state-action pairs. This replaces noisy reward aggregation with dense, stable supervision. DIBS achieves significant improvements in both training stability and zero-shot generalization against existing RL and meta-RL algorithms.
[AI-184] Subliminal Learning is a LoRA Artifact
链接: https://arxiv.org/abs/2606.00831
作者: Todd Nief,Harvey Yiyun Fu,Mark Muchane,Ari Holtzman
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning (“You are Qwen, created by Alibaba Cloud. You are a helpful assistant.”) does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model’s default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.
[AI-185] Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation KDD2026
链接: https://arxiv.org/abs/2606.00827
作者: Xinpeng Lv,Chunyuan Zheng,Yunxin Mao,Renzhe Xu,Jinxuan Yang,Yuanlong Chen,Wangrong Huang,Shaowu Yang,Wenjing Yang,Xinwang Liu,Peng Cui,Haotian Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by SIGKDD2026
Abstract:Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive models. Existing fairness-aware SC approaches primarily focus on group fairness and typically assume that agents respond independently. However, when individual fairness is required, ensuring similar individuals receive similar outcomes, agents’ manipulation becomes interdependent: an agent’s preferred manipulation depends on the neighborhoods’ outcomes. This induces a mismatch between classical SC formulations and fairness-aware decision settings, where independent models no longer accurately characterize strategic manipulations. To address this issue, we introduce individual fairness-aware strategic classification (IFSC), a framework that models peer-driven manipulation arising from individual fairness, where agents imitate nearby positively decided peers to obtain favorable outcomes. IFSC characterizes strategic manipulation as similarity-based imitation toward visible accepted peers and learns classifiers under the resulting post-manipulation distributions. To account for uncertainty in peer observability, IFSC employs a robust learning process that introduces stochastic perturbations during manipulation simulation. Experiments on synthetic and real-world datasets demonstrate that IFSC improves individual-fairness consistency and mitigates imitation-induced distortions.
[AI-186] Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping
链接: https://arxiv.org/abs/2606.00819
作者: Hanze Li,Jinhao You,Yichen Guo,Kai Tang,Shuangyang Xie,Xiande Huang
类目: Artificial Intelligence (cs.AI)
备注: 5 pages
Abstract:Large Language Models (LLMs) have achieved strong performance across diverse natural language tasks, yet their outputs often suffer from hallucinations – content that is misaligned with factual information. In this work, we conduct a comprehensive layer-wise analysis of the decoding process and reveal that hallucinations tend to originate from deeper decoder layers. To address this issue, we introduce \textbfDeLask (\textbfDecoder \textbfLayer \textbfSkipping), a novel decoding framework that dynamically skips layers prone to producing hallucinations. DeLask leverages the theoretical insight that the forward computation of an L -layer Transformer is conditionally equivalent to L steps of gradient descent. We define a \emphdriftance value by computing the cosine similarity between gradients derived from consecutive decoder steps, identifying problematic layers when the descent direction reverses. Rather than discarding such layers entirely, DeLask partially aggregates their hidden states with preceding layers, thereby preserving consistency while suppressing erroneous signals. Extensive experiments across diverse LLMs and benchmarks demonstrate that DeLask consistently mitigates hallucinations and enhances overall reliability, providing a lightweight and generalizable decoding framework for improving the robustness of large-scale language models.
[AI-187] NBQ: Next-Best-Question for Dynamic Profiling
链接: https://arxiv.org/abs/2606.00809
作者: Yimin Shi,Clarice Wang,Haixun Wang,Xiaokui Xiao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Many real-world conversational settings for knowledge discovery, including podcasts, hiring screens, and marketplaces, require a purpose-driven understanding of a person. We study the Next-Best-Question (NBQ) problem: at each turn, an interviewer should ask the question with the highest expected information gain given what has already been learned and the conversation goal. We propose NBQ, a plug-and-play framework that seeds a diverse pool of candidate questions, maintains a compact and continuously updated user state, adaptively selects the next question within a turn budget, and distills the resulting free-form dialogue into a structured vector-based user profile. As a demanding application, we instantiate NBQ for reciprocal matchmaking, where compatibility must be mutual and each person is modeled by both self-description and counterpart-preference representations. To support large-scale matching, we further introduce QuickMatch, an efficient retrieval layer that recasts reciprocal matching from quadratic pairwise scoring to approximate vector search. Experiments show that NBQ improves user profiling quality by up to 13.6% and 14.0% in AC@T and AR@T, respectively, while QuickMatch accelerates retrieval by up to 22.9x with recall up to 0.989.
[AI-188] Extending Causal Metamodeling to a non-Markovian Queue
链接: https://arxiv.org/abs/2606.00795
作者: Pracheta Amaranath,Anant Bhide,David Jensen,Peter Haas
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work introduced modular dynamic Bayesian networks (MDBNs) – a class of metamodels that can estimate a range of probabilistic and causal queries (PCQs) using a single, trained model – but the method was limited to Markovian systems. In this paper, we initiate an extension of MDBNs to non-Markovian queues by approximating non-exponential distributions using phase-type distributions. This approach raises novel challenges, including balancing metamodeling accuracy and tractability when choosing the number of phases, efficiently learning metamodel parameters, and choosing the sampling interval that is used to approximate a continuous-time simulation by a discrete-time MDBN. We provide preliminary solutions to these challenges, yielding the first causal metamodeling technique for non-Markovian systems. Experiments on a G/M/1 queue demonstrate that the MDBN can produce accurate answers to PCQs with orders-of-magnitude speedup of inference times relative to direct simulation.
[AI-189] Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.00780
作者: Fuyuan Qian,Menglong Zhang,Song Wang,Quanying Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML2026
Abstract:Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.
[AI-190] Logit Distillation on Manifolds: Mapping by Learning
链接: https://arxiv.org/abs/2606.00771
作者: Yiru Yang,Junling Wang,Nishant Kumar Singh,Luohong Wu,Haoran Yan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.
[AI-191] FALAT: Tracing Failures in LLM Agent Trajectories via Dependency-Guided Search
链接: https://arxiv.org/abs/2606.00765
作者: Md Nakhla Rafi,Md Ahasanuzzaman,Dong Jae Kim,Zhijie Wang,Tse-Hsun Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLM-based agents increasingly solve complex tasks through long trajectories involving reasoning steps, tool calls, and inter-agent communication. However, when these agents fail, it is often unclear which agent caused the failure and which step introduced the decisive error. This attribution problem is challenging because mistakes can propagate across the trajectory: later actions may appear incorrect, but only because they depend on an earlier corrupted state. Therefore, failure attribution cannot be treated as independent step-level classification. We propose FALAT, a diagnostic framework for failure attribution in LLM agent trajectories. FALAT frames attribution as a dependency-guided search problem. It first constructs an expectation of how the task should be solved and uses this expectation to identify suspicious regions in the trajectory. It then traces dependencies among decisions, tool outputs, and agent messages to distinguish error-introducing steps from steps that merely inherit or propagate prior mistakes. Finally, FALAT evaluates whether correcting a candidate step would be sufficient to recover the expected outcome, allowing it to identify both the responsible agent and the decisive failure step. We evaluate FALAT on the WhoWhen benchmark, which includes both algorithm-generated and hand-crafted multi-agent failure trajectories. The results show that FALAT consistently improves responsible-agent and decisive-step attribution. Its best configurations achieve 46.0% step-level accuracy on algorithm-generated trajectories and 29.1% on the more challenging hand-crafted trajectories, outperforming specialized attribution baselines and direct prompting with standalone LLMs. These findings suggest that dependency-aware reasoning is essential for reliable failure diagnosis in LLM agent systems. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00765 [cs.AI] (or arXiv:2606.00765v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.00765 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-192] CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems
链接: https://arxiv.org/abs/2606.00756
作者: Yannan Wang,Longli Yang,Zhen Liu,Abhishek Kumar,Carsten Maple
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Deploying lightweight Large Language Model (LLM) agents on edge servers can reduce latency and move agentic services closer to users, but resource-constrained edge models often struggle with long-horizon tasks that require persistent memory, subgoal tracking, and reflection. Fine-tuning edge models after deployment is costly and difficult to scale across heterogeneous nodes, while purely local memory leaves agents with isolated experience and growing prompt context. We propose \textscCoMIC, a parameter-update-free cloud-edge framework for Collaborative Memory and Insights Circulation. \textscCoMIC follows a \textitCentralized Reflection, Decentralized Execution design: edge agents execute locally using subgoal-oriented hierarchical memory and selective re-expansion of relevant histories, while a cloud-side LLM critic asynchronously evaluates completed trajectories, filters reusable experience, and aggregates cross-agent guidance keyed by semantic subgoal identifiers. Across five long-horizon agent tasks spanning symbolic planning and text interaction, \textscCoMIC improves progress rate and action grounding for weak edge agents and yields task-dependent success-rate gains without updating model parameters.
[AI-193] Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment
链接: https://arxiv.org/abs/2606.00741
作者: Uiwon Hwang,Jaeho Hwang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on |W_\ell|\infty and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at p\mathrmflip =0.10 and a transformer encoder at p_\mathrmflip =0.05, TAC reaches 95% of clean accuracy with 3.4 \times to 33.6 \times less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio \rho^* predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware–software co-design beyond conventional scaling limits.
[AI-194] SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
链接: https://arxiv.org/abs/2606.00732
作者: Jayanta Dey,Shikhar Srivastava,Itamar Lerner,Christopher Kanan,Dhireesha Kudithipudi
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Learning long-range non-stationary temporal patterns remains a core challenge for modern sequence models, particularly in strict streaming settings. In these settings, data arrive sequentially and must be processed in a single pass without simultaneously revisiting past observations. Standard architectures, including recurrent neural networks and transformers, are constrained by either truncated backpropagation through time horizon or explicit input window length for long range credit assignment. To address these limitations, we propose SHARP (Sleep-based Hierarchical Accelerated Replay), a framework that decomposes temporal learning into two complementary components: a memory module that accumulates a structured history of past inputs, and a pattern-recognition module that operates over this memory. This separation enables resource- and compute-efficient adaptation to non-stationary dynamics by eliminating the need for backpropagation through time across many steps for long-range credit assignment. Inspired by the accelerated replay observed in rodents during slow-wave sleep, SHARP incorporates offline (sleep) phases in which temporally structured memory traces are replayed in an accelerated form and integrated into higher-level memory representations, improving long-range context retention. Through controlled simulations and ablation studies, we characterize the key properties of the proposed framework. In benchmark datasets such as text8 and PG-19, we demonstrate that SHARP improves over recurrent baselines by retaining next-token predictive performance on previously seen data while continuing to learn from the current stream and generalizing to future unseen data. These gains are enabled by its hierarchical structure, which yields an exponentially increasing effective temporal context with only linear-time computational cost.
[AI-195] AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France the United States and China
链接: https://arxiv.org/abs/2606.00729
作者: Kim Phuc Tran
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Artificial Intelligence is often discussed in France in terms of investment, compute capacity, regulation, employment, sovereignty, and education. These dimensions are usually treated separately. This viewpoint paper proposes a unified interpretation: France should be understood as a \emphnational AI learning system. Building on Human-Centered Learning Mechanics (HCLM), recently formulated as a dynamical framework for entropy-regulated representation learning, we interpret national AI development as a controlled balance between information injection and entropy dissipation. Information injection corresponds to compute, data, talent, research, capital, industrial deployment, and institutional experimentation. Entropy dissipation corresponds to organizational complexity, coordination frictions, energy constraints, regulatory uncertainty, talent mobility pressures, and opportunities to strengthen industrial absorption. The central claim is that AI sovereignty does not emerge from scale alone but from a country’s capacity to regulate its own information dynamics. This paper connects HCLM with neural scaling laws, endogenous growth theory, creative destruction, and game theory. It argues that the French AI debate should move beyond the binary opposition between techno-optimism and regulation-first caution. A competitive and human-centered AI strategy requires a controlled regime in which information injection grows faster than institutional dissipation, while avoiding unstable, unequal, or energy-intensive expansion. We provide a mathematical model, measurable policy indicators, game-theoretic propositions, illustrative simulations of national AI regimes, and concrete policy implications for France. The proposed viewpoint reframes AI policy as the governance of an open, strategic, non-equilibrium learning system.
[AI-196] Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLM s
链接: https://arxiv.org/abs/2606.00726
作者: Jiakang Li,Guanyu Zhu,Can Jin,Chenxi Huang,Dexu Yu,Ronghao Chen,Yang Zhou,Hongwu Peng,Xuanqi Lan,Dimitris N. Metaxas,Youhua Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Strong reasoning depends not only on model knowledge but also on how effectively cognitive behaviors are deployed during generation. Existing methods often rely on explicit behavior-level control, making them insufficiently adaptive when failures and required corrections vary across reasoning states, tasks, and models. To this end, we propose Latent Reward Steering (LRS), an adaptive inference-time framework that promotes cognitive behaviors by optimizing the sparse-autoencoder (SAE) latent states that implicitly carry them. Rather than relying on predefined cognitive behaviors or steering directions derived from them, LRS trains a latent reward model on reasoning traces by final answer correctness to estimate the quality of intermediate latent states. During inference, reward gradients provide state-specific correction directions for fragile latent states, while a reward and confidence gate restricts intervention to states the reward signal flags as fragile. Experiments on multiple reasoning LLM backbones and benchmarks show that \ours consistently improves performance over various baselines, and post-hoc analyses further indicate that \ours implicitly promotes good cognitive behaviors that fix the original reasoning errors. Code is available at: this https URL.
[AI-197] LLM -Driven Co-Evolutionary Automated Heuristic Design for Bi-Component Coupled Combinatorial Optimization
链接: https://arxiv.org/abs/2606.00718
作者: Mingen Kuang,Xudong Deng,Xi Lin,Ye Fan,Jianyong Sun,Jialong Shi
类目: Artificial Intelligence (cs.AI); Optimization and Control (math.OC)
备注:
Abstract:While Large Language Models (LLMs) have recently shown promise in Automated Heuristic Design (AHD), existing methods typically generate and evolve heuristics as a single operator or search strategy, limiting their ability to model strong coupling among multiple decision substructures in problems such as the Traveling Thief Problem (TTP) and the Traveling Purchaser Problem (TPP). In this work, we propose CoEvo-AHD, an LLM-driven dual-population co-evolutionary framework for automated heuristic design in coupled combinatorial optimization. Unlike prior methods that evolve individual heuristics in isolation, CoEvo-AHD leverages LLMs to co-evolve two closely related operator populations. A cooperative evaluation mechanism explicitly captures interactions between route and selection operators, while pairwise scoring and synergistic joint crossover help discover complementary operator logic for joint improvement across coupled decision subspaces. We further design a tool-invocation environment library that encapsulates frequently used core operations, such as local-search delta computation, into callable functions, enabling LLM-generated operators to use standardized interfaces instead of reimplementing inefficient and error-prone problem-specific loops. Experiments on TTP and TPP show that CoEvo-AHD automatically discovers cooperative heuristic combinations and achieves competitive solution quality against traditional heuristics.
[AI-198] Multi-Agent Conformal Prediction with Personalized Statistical Validity
链接: https://arxiv.org/abs/2606.00717
作者: Martin V. Vejling,Christophe A. N. Biscio,Adrien Mazoyer,Petar Popovski,Shashi Raj Pandey
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.
[AI-199] MOSAIC: Modular Orchestration for Structured Agent ic Intelligence and Composition
链接: https://arxiv.org/abs/2606.00708
作者: Yifan Bao,Xinyu Xi,Xinyu Liu,Wen Ge,Lei Jiang,Kevin Zhang,Raad Khraishi,Yihao Ang,Anthony K.H. Tung,Lukasz Szpruch,Hao Ni
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textscMOSAIC (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textscMOSAIC builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textscMOSAIC on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textscMOSAIC improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.
[AI-200] Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation
链接: https://arxiv.org/abs/2606.00703
作者: Munsik Kim
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability – algorithms and empirical scaling laws – with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order oracle: an optimizer interacts for T rounds and receives, each round, a B-bit adaptive public-coin description of its stochastic gradient. Our main contribution is an exact reduction from optimizing a strongly convex quadratic family to interactively compressed Gaussian mean estimation – under the B-bit oracle the query carries no information, so optimization collapses exactly onto a sequential distributed-estimation problem. This yields two unconditional lower bounds, a communication bound TB = Omega(d) and a statistical bound T = Omega(sigma^2 d / eps^2), and the sharp product-form bound T = Omega((sigma^2 d / eps^2) max1, d/B). The product form is also unconditional: a B-bit transcript carries at most O(TB / sigma^2) of Fisher trace about the mean, so bits rather than dimension limit the recoverable information, and combined with the multivariate van Trees inequality this gives the bound directly, without bounded-likelihood-ratio truncation. We give a near-matching achievability result with exact per-round bit accounting under a bounded-dynamic-range oracle, tight up to a logarithmic factor; the lower bound is for truly Gaussian (unbounded) gradients, and closing this oracle gap is left open. A sequential rate-distortion perspective extends the reduction to correlated and drifting oracles and corrects an earlier conjecture: positive noise correlation raises the bound by (1+rho)/(1-rho) rather than relaxing it. The bounds give an information-theoretic baseline for any low-bit gradient path, not an optimality claim about deployed FP4 systems.
[AI-201] Shape Your Body: Value Gradients for Multi-Embodiment Robot Design
链接: https://arxiv.org/abs/2606.00702
作者: Nico Bohlinger,Jan Peters
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:We propose to turn generalist multi-embodiment value functions into reusable models for robot design. Instead of running a new reinforcement learning co-design loop for each robot, we first train an embodiment-aware policy and value function across many robot designs. After training, the frozen value function is used as a differentiable surrogate to optimize candidate embodiments through value gradients. We evaluate our approach across different robot design settings, from perturbed single robots to held-out robots across morphology classes, with single models trained on up to 50 robots and design spaces of over 1100 continuous embodiment parameters. Beyond optimizing complete embodiments, we show that value gradients can identify performance-limiting design and control parameters, enabling both the optimization and the analysis of new robot designs.
[AI-202] COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs ICML2026
链接: https://arxiv.org/abs/2606.00700
作者: Sheng’en Li,Dongmian Zou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ICML 2026
Abstract:Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at this https URL.
[AI-203] Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
链接: https://arxiv.org/abs/2606.00680
作者: Hongqiang Lin,Pengfei Wang,Nenggan Zheng
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Offline reinforcement learning (RL) aims to optimize policies from pre-collected datasets. A bottleneck of this paradigm is managing epistemic uncertainty, which arises from limited data coverage (sample-level) and the ambiguity in identifying transition dynamics from finite data (model-level). To provide a unified quantification of these uncertainties, Bayesian RL has been proposed by treating the dynamics model as a random variable and maintaining a corresponding belief. Despite its theoretical appeal, policy optimization in Bayesian RL remains computationally challenging as it requires solving composite objectives with expectations. Prior methods either employ search-based techniques with poor computational scalability or impose restrictive posterior assumptions that sacrifice the adaptability of Bayesian RL. To address these limitations, we propose Posterior Hybrid Bayesian Belief (PhyB), which reformulates the expectation as a convex combination over a subset of dynamics models. Theoretical analysis demonstrates that the objective discrepancy induced by this approximation remains bounded. Based on PhyB, we develop an iterative regularized policy optimization algorithm that provides metric-agnostic guarantees for monotonic improvement until convergence. Empirical results demonstrate that PhyB achieves state-of-the-art performance on various benchmarks.
[AI-204] he Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLM s
链接: https://arxiv.org/abs/2606.00674
作者: Zihan Chen,Yiming Zhang,Wenxiang Geng,Zenghui Ding,Yining Sun
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening’’ of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ( \eta ) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.
[AI-205] Medication-Aware Financial Exploitation Detection for Alzheimers Patients Using Edge-Aware Interaction Risk Modeling
链接: https://arxiv.org/abs/2606.00672
作者: Farzana Akter,Lisan Al Amin,Rakib Hossain,Chaitanya Gunupudi,Faisal Quader
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Financial exploitation is a growing concern for people with Alzheimer’s disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.
[AI-206] Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty
链接: https://arxiv.org/abs/2606.00670
作者: Zhou Yang,Yueyi Yang
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:
Abstract:Face-to-face speech comprehension is inherently multimodal, integrating acoustic signals with visible articulation, facial expression, head motion, and other socially relevant cues. While audiovisual speech systems typically focus on the mouth region as the primary visual source of linguistic information, affective facial expressions are often treated separately as emotion-recognition targets. This paper investigates whether upper-face affective information contributes to audiovisual sentence recognition beyond audio and mouth-region cues, particularly under acoustic degradation. Using the CREMA-D audiovisual emotional speech corpus, we train feature-based sentence classifiers under four cue conditions: audio only (A), audio plus mouth/lower-face features (A+M), audio plus upper-face features (A+U), and audio plus both mouth and upper-face features (A+M+U). Models are evaluated on clean audio and pink-noise conditions at +10 dB, +5 dB, and 0 dB SNR using actor-independent splits. Results show that mouth/lower-face features provide substantial robustness benefits under degraded audio. At 0 dB SNR, A+M improves accuracy over A by 0.0794, with an actor-bootstrap 95% confidence interval of [0.0296, 0.1298]. Upper-face affective cues exhibit a more nuanced effect. Although the direct accuracy gain of A+M+U over A+M is small, full-face models consistently improve calibration across SNR levels and outperform shuffled upper-face controls under noisy conditions. These findings suggest that affective facial information may support multimodal robustness and confidence estimation under acoustic uncertainty without directly encoding lexical content. More broadly, the study highlights the potential role of socially expressive facial cues in human-centered audiovisual interaction systems. Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00670 [cs.SD] (or arXiv:2606.00670v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2606.00670 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-207] Demystifying the Optimal Fair Classifier in Multi-Class Classification ICML2026
链接: https://arxiv.org/abs/2606.00656
作者: Li Zhang,Yuyuan Li,XiaoHua Feng,Jiaming Zhang,Fengyuan Yu,Chaochao Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i) characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii) designing practical algorithms that attain this optimum in different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.
[AI-208] ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
链接: https://arxiv.org/abs/2606.00644
作者: Qiuyu Tian,Zequn Liu,Yingce Xia,Haojie Yin,Youyong Kong
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.
[AI-209] Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLM s
链接: https://arxiv.org/abs/2606.00642
作者: Yu-An Lu,Ci-Yang Tsai,Yu-Lin Tsai,Raluca Ada Popa,Chia-Mu Yu
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Reasoning traces have become a valuable form of learning signals for improving and transferring the capabilities of large language models. In particular, detailed traces can help distill reasoning behavior from stronger teacher models into weaker student models. The value of capability transfer has motivated many deployed systems with reasoning models to hide raw internal traces and expose at most summaries and answers to users. As a result, we ask whether such interface-level trace hiding prevents users from obtaining useful reasoning supervision through prompting. We study this question with Reasoning Exposure Prompting (REP), a lightweight in-context elicitation method that uses shadow-model-generated demonstrations wrapped in auxiliary code-like formats to raise user-visible reasoning traces from a victim model. Across the common reasoning dataset, different victim models, and different student model distillation, REP substantially increases similarity between exposed and REP-conditioned internal traces while preserving useful reasoning signals.
[AI-210] LP5X-PIM Sim: A High-Fidelity HW/SW Integrated Simulator for LPDDR5X-PIM
链接: https://arxiv.org/abs/2606.00636
作者: SangHoon Cha,Jaewan Choi,Byeongho Kim,Yoonah Paik,Sukhan Lee,Kyomin Sohn
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 4 pages, 4 figures, tech note
Abstract:This tech note describes the architecture and execution results of the LPDDR5X-PIM simulator, developed by Samsung Electronics. Based on the latest research and internal specifications, the simulator provides a high-fidelity model of both the hardware data paths and the software control layers of the LPDDR5X-PIM block. This integrated hardware-software simulation approach enables precise evaluation of system performance and energy efficiency while maximizing PIM resource utilization. We have refined existing simulation frameworks to align with actual hardware implementation, ensuring consistent behavioral accuracy. Further technical details regarding the specific architecture and circuit design of the LPDDR5X-PIM will be disclosed in future publications
[AI-211] Authenticity Debt and the Synthetic Content Threat Landscape: A Layered Framework for Trust Provenance and IP Governance in the Generative AI Era
链接: https://arxiv.org/abs/2606.00621
作者: Shubhashis Sengupta,Benjamin McCarty,Milind Savagaonkar,Rhine Andotra
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:Generative artificial intelligence has fundamentally changed how content is now produced. It has enabled how high-fidelity text, images, audio, and videos are created, modified, and redistributed at near-zero marginal cost. This shift exposes enterprises and ecosystems to a number of risks across four reinforcing authenticity layers – authenticity, provenance, integrity, and accountability – that traditional controls are inadequate to address in isolation. We introduce the concept of authenticity debt: the cumulative institutional liability that accumulates when organizations deploy AI-generated content without preserving verifiable origin, integrity, and accountability, deferring exposure that surfaces under regulatory, legal, or market scrutiny. This paper presents a comprehensive, multi-dimensional taxonomy of generative AI harms and attack vectors, surveys the capabilities and failure modes of technical controls including digital watermarking, provenance frameworks (C2PA, Adobe CAI), and detection technologies, and argues that no single mechanism is sufficient in open, adversarial, and evolving environments. Drawing on Zero Trust Architecture principles and enterprise governance frameworks, we propose a layered reference architecture that integrates cryptographic provenance, human-in-the-loop verification, and continuous governance to sustain defensible authenticity at scale. We further examine the regulatory landscape (EU AI Act, U.S.\ FTC, NIST AI RMF) and identify practical guiding principles for organizations seeking to build authenticity as institutional infrastructure rather than an afterthought.
[AI-212] Efficient Test-time Inference for Generative Planning Models
链接: https://arxiv.org/abs/2606.00618
作者: Robert Gieselmann,Mihai Samson,Federico Pecora,Jeremy L. Wyatt
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Generative models have emerged as a powerful paradigm for AI planning, yet their performance remains constrained by the training data distribution. One approach is to improve generated solutions during inference by scaling test-time compute. A more efficient alternative is to optimize the inference process itself. In this paper, we show that a modified version of a classical Open-Closed List (OCL) search provides just such an efficient inference procedure. Our algorithm synergizes two learned components: a generative model that performs fast rollouts from intermediate states and a heuristic model that prioritizes among candidate reasoning paths. Key contributions include novel exploration control mechanisms and integration of learned models within the OCL framework. Across multiple combinatorial planning domains, our approach outperforms both neurosymbolic search baselines and classical solvers in computational efficiency and solution quality.
[AI-213] RACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety
链接: https://arxiv.org/abs/2606.00611
作者: Zhepei Hong,Lin Wang,Liting Li,Haokai Ma,Junfeng Fang,Fei Shen,Dan Zhang,Xiang Wang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon LLM agents produce safety evidence across long trajectories, where sparse, delayed, and compositional risk signals often escape local moderation. Existing turn-level or short-context detectors struggle to reliably retain and aggregate such evidence over extended horizons. We reframe long-horizon agent safety detection as trajectory-level evidence compression and propose Trajectory Risk-Aware Compression for Long-Horizon Agent Safety (TRACE). TRACE uses a Compressor-Reader design: the Compressor encodes the full trajectory into a compact latent evidence state under trajectory-level supervision, and the Reader judges the raw trajectory with this latent evidence state as a safety reference. This design helps aggregate dispersed risk cues and reduce premature evidence loss. Across ASSEBench, Pre-Ex-Bench, and R-Judge, TRACE achieves the best accuracy on all evaluated backbones, improving over strong baselines by up to 12.6 percentage points. On LongSafety, TRACE shows smaller performance degradation as context length grows. Attention visualizations and case studies suggest that the compressed reference helps the Reader focus on risk-critical segments and recover cross-step evidence. Code is available at this https URL.
[AI-214] CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts
链接: https://arxiv.org/abs/2606.00609
作者: Rui Zhang,Xinle Wu,Yao Lu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.
[AI-215] PropLLM : Propagation-Aware Scene Reconstruction for Network Fault Diagnosis
链接: https://arxiv.org/abs/2606.00582
作者: Zongzong Wu,Ming Zhao,Fengxiao Tang,Nei Kato
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Network faults propagate layer by layer along topology and protocol dependencies, yet operations systems typically observe only symptomatic alerts at the tail end of propagation chains, where distinct root-cause faults may produce highly similar end-point symptoms. Existing approaches, whether rule-based, machine learning (ML)-based, or large language model (LLM)-based, fundamentally map the alert set to a diagnosis in a single pass and are structurally incapable of resolving this end-point ambiguity. This paper proposes PropLLM, which is the first to integrate the hop-by-hop scene reconstruction paradigm with the generative reasoning capabilities of LLMs. Starting from end-point alerts, PropLLM traces back hop-by-hop along the propagation path, retrieving verifiable factual evidence from a dual-layer knowledge graph (KG) at each hop, while the proposed Temporal Causal Propagation Attention (TCPA) mechanism encodes known topological causal priors directly into the attention computation to guide the model along the correct causal direction, ultimately localizing the root cause and determining the fault type through a fully evidenced causal chain. On a real-world Wi-Fi multimodal fault dataset, PropLLM improves fault type diagnosis accuracy by 3.9% and root cause localization accuracy by 4.7% over the strongest baseline, while reducing the hallucination rate by 50.8%. Supplementary experiments on the TeleLogs 5G dataset further demonstrate the effectiveness of the proposed method across different network scenarios.
[AI-216] A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models KDD’26
链接: https://arxiv.org/abs/2606.00563
作者: Kara Liu,Maggie Wang,Russ B. Altman
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 32 pages, 27 figures, will be published at ACM SIGKDD '26
Abstract:Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.
[AI-217] Interpretable Policy Distillation for Power Grid Topology Control
链接: https://arxiv.org/abs/2606.00561
作者: Aleksandra Dmitruka,Karlis Freivalds
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op’s standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.
[AI-218] Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction AAAI2026
链接: https://arxiv.org/abs/2606.00559
作者: Jiafu Huang,Chao Peng,Chenyang Xu,Zhengfeng Yang,Kecheng Cai,Chenhao Zhang,Yi Wang,Yiwei Gong,Wanqin Zhou,Irene Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Appeared at AAAI 2026
Abstract:Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms can be abstracted as a sequence of states, where each state represents the intermediate outcome after an execution step. The training objective is to generate state sequences that replicate the underlying algorithmic process. A common framework for this task adopts an encoder-processor-decoder architecture, where the encoder learns representations of states, the processor simulates algorithmic steps, and the decoder reconstructs output states. While prior work has focused on improving the processor, the role of the encoder in representation learning has received little attention. Most methods rely on simple MLP encoders, raising the question of whether such representations are sufficiently informative for supporting algorithmic reasoning. This paper investigates how to improve encoder representations for neural algorithmic reasoning. We propose a reconstruction module that aims to recover the input state from its encoded representation. This auxiliary reconstruction task encourages the encoder to retain critical information about the input. We demonstrate that incorporating this task during training improves the performance of existing neural architectures on standard benchmarks. Furthermore, we observe that current encoders often underutilize the correlations among features within a state. To address this, we draw inspiration from self-supervised learning and design an enhanced variant of the auxiliary task that encourages the encoder to capture intra-state feature dependencies. Experimental results show that our method enables the encoder to learn richer representations, thereby enhancing the performance of existing processors on algorithmic reasoning tasks.
[AI-219] Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
链接: https://arxiv.org/abs/2606.00555
作者: Zaifei Yang,Weiyu Chen,Yaqing Wang,James Kwok
类目: Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM)
备注:
Abstract:Structure-based drug design increasingly employs LLM agents to iteratively refine ligands against a target pocket, yet a viable ligand must satisfy two often-conflicting objectives – binding affinity and druggability – which single optimization steps rarely improve together. To quantify this difficulty, we introduce two diagnostic metrics: the first measures how often a single edit improves both objectives, and the second measures how often a gain on one objective comes with a loss on the other. Applying these diagnostics to current LLM-agent pipelines exposes a consistent failure mode: the agent performs molecular editing without knowing how the pocket-ligand complex responds to local modifications, thus rarely achieving joint improvement. Inspired by medicinal chemists, who probe the pocket-ligand complex with controlled analog edits before choosing an optimization direction, we propose \textbfPROBE, an optimization framework built around edit-response probing. PROBE first decomposes the ligand into editable sites and builds a pocket-specific \textbfsite map that flags where joint gains are plausible, where the two objectives are likely in tension, and where liability substructures should be changed; it then performs controlled probe edits whose responses are distilled into an \textbfEditManual. Guided by the site map and EditManual, PROBE runs an iterative multi-agent loop in which an affinity agent, a druggability agent, and a co-optimization agent jointly produce edits. On the CrossDocked2020 benchmark, PROBE achieves state-of-the-art performance and substantially mitigates the failure modes exposed by our diagnostics metrics.
[AI-220] KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning
链接: https://arxiv.org/abs/2606.00532
作者: Jayant Parashar,Suchendra M. Bhandarkar
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, 6 tables
Abstract:Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.
[AI-221] Acting with AI: An Interaction-Based Framework for Agent ic Tort Liability
链接: https://arxiv.org/abs/2606.00518
作者: Yiheng Yao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman’s planning theory and on the common law’s treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent’’ standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.
[AI-222] hreshold-Based Exclusive Batching for LLM Inference ICML2026
链接: https://arxiv.org/abs/2606.00516
作者: Weifang Zhang,Yuzhou Nie,Bowen Pang,Guangrui Ma,Shining Wu
类目: Artificial Intelligence (cs.AI)
备注: 37 pages, 12 figures. Accepted at ICML 2026
Abstract:Mixed batching (MB)–interleaving prefill and decode in a single batch–has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB’s per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.
[AI-223] PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation
链接: https://arxiv.org/abs/2606.00515
作者: Haofan Cao,Zhaoyang Li,Zhichao You,Liang Guo,Tianrui Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Under review, code will be available soon
Abstract:Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.
[AI-224] EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction KDD2026
链接: https://arxiv.org/abs/2606.00506
作者: Dahai Yu,Rongchao Xu,Lin Jiang,Guang Wang
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by KDD 2026 AI4S
Abstract:Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.
[AI-225] abChange: Precise Attribute Changes in Tabular Data
链接: https://arxiv.org/abs/2606.00503
作者: Arjun Dahal,Yu Lei,Raghu N. Kacker,Richard Kuhn
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don’t support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.
[AI-226] APS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
链接: https://arxiv.org/abs/2606.00487
作者: Zhuoyu Wang,Junnan Huang,Xinyu Chen
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Using a diffusion model for parallel drafting is a promising approach for speculative decoding. By predicting tokens at multiple future positions in a single forward pass, diffusion drafters substantially reduce drafting latency. However, this shifts the bottleneck to verification: verifying a single sequence limits acceptance length, while verifying large draft trees incurs excessive target-model latency. We identify a key mismatch in existing draft-tree methods: existing diffusion-tree methods rank nodes by the marginal probability, ignoring that verification is prefix-conditioned. As a result, they may verify unreachable descendants of rejected prefixes, increasing latency with limited acceptance gains. To address this, we propose TAPS, a target-aware prefix selection method that turns diffusion marginals into path-conditioned acceptance estimates. TAPS then selects a compact prefix-closed subtree under a fixed verification budget, improving the acceptance-cost tradeoff rather than simply expanding the draft tree. Experiments across diverse datasets and model families demonstrate that TAPS achieves up to 7.9x lossless end-to-end speedup over vanilla autoregressive decoding, outperforming state-of-the-art DFlash and DDTree by 1.36x and 1.74x respectively. Our work is available at this https URL
[AI-227] Doing What They Say Not What They Reason : Locating the Faithfulness Gap in LLM Agents
链接: https://arxiv.org/abs/2606.00476
作者: Yufeng Wang
类目: Artificial Intelligence (cs.AI)
备注: submitted to COLM social simulation with LLM workshop
Abstract:Do LLM agents act on the reasoning they state? This question of process fidelity is central to using LLMs in social simulation, yet it is hard to measure where no reference for correct behavior exists. We study it in acontrolled setting, a Texas Poker simulator with a verifiable reference action for every decision by decomposing the faithfulness gap into two steps: reasoning-conclusion and conclusion-action. The two steps behave oppositely.
[AI-228] When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
链接: https://arxiv.org/abs/2606.00448
作者: Su Wang,Pin Qian,Yihang Chen,Junxian You,Xiaoyuan Wang,Xiaochong Jiang,Lifei Liu,Haoran Yu,Jingzhou Xu
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:LLM agents increasingly rely on community-contributed skills that expand an agent’s operational capability set. We study a core safety problem in agentic AI systems: whether individually safe skills can compose into unsafe installed skill sets. We present SkillReact, a compositional security measurement framework with three components: a deterministic static-composition benchmark, a two-rater LLM-assisted human-adjudication pipeline, and an action-based exploitability harness. On 1,520 ClawHub skills, 651 pass individual inspection and form 211,575 pairs; the benchmark flags 22.25% of these as structural candidates. We treat this raw rate as a recall-oriented scanner ceiling and calibrate it against human judgment: in a pattern-stratified audit, roughly one in five flagged pair-pattern hits survives as a real compositional risk (population-weighted validity 18.2%, our headline result), implying about 14K genuine risk memberships in a single registry that per-skill scanning misses by construction, since every pair is individually safe. An action-based harness then probes when these candidates become model-issued tool calls, and finds realization gated by host-model disposition: on an anchor-conditioned dropper subset, Haiku-4-5 issues the dropper-stage tool call on all 39 direct-prompt trials (36 of them the full download-then-execute chain, 3 download-only), Opus-4-7 stops at the download, and Sonnet-4-6 refuses outright. A control that holds the request fixed and varies only the installed skills finds compliance highest with no skills installed: a composition fixes which capabilities are reachable, while the host model decides whether to use them. Together these motivate install-time compositional checks and capability isolation as complements to per-skill scanning.
[AI-229] SDR: Set-Distance Rewards for Radiology Report Generation
链接: https://arxiv.org/abs/2606.00440
作者: Halil Ibrahim Gulluk,Max Van Puyvelde,Wim Van Criekinge,Olivier Gevaert
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning with verifiable rewards has rapidly advanced reasoning in vision–language models. However, for chest X-ray report generation, the standard rewards (i.e. exact-match accuracy and step-level processes) are incompatible because the reports consist of unordered and orthogonal findings, rather than a causal reasoning chain. We address this gap with a set-based view: each report is split into sentences and embedded by a frozen sentence transformer, yielding unordered embedding sets. We propose the use of set-to-set distances between generated and reference embeddings as continuous, permutation-invariant rewards. Across two datasets and three vision–language models (Qwen3-VL-2B/4B, Gemma3-4B), post-training with set-to-set distance based rewards via GRPO consistently outperforms supervised fine-tuning and exact-match GRPO on all headline metrics (BERTScore, RadGraph F1 and CheXbert F1 by average %6.80, %7.82 and %4.45 relative improvements respectively). The same set distances also enable test-time best-of- N selection: scoring candidates by their distance to training-report embeddings outperforms random selection on our trained models as well as three closed-source LLMs (Mistral-Small, Gemini-2.5 Flash-Lite, GPT-4o-mini) with on average %16.4 relative improvement on BERTScore. Used as a streaming signal, they support a more efficient form of test-time scaling: pruning low-scoring candidates mid-generation reduces generated tokens by over 50% while preserving the Findings quality of full best-of- N selection. Together these results establish set-distance rewards as a unified signal for both post-training and test-time scaling in chest X-ray report generation. Our code is publicly \hrefthis https URLavailable.
[AI-230] Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
链接: https://arxiv.org/abs/2606.00424
作者: Can Jin,Jiakang Li,Rui Wu,Eddy Zhang,Dimitris N. Metaxas
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:As large language models become stronger, weak supervisors may fail to provide reliable labels, preferences, or final judgments for complex outputs, limiting both weak-to-strong generalization and scalable oversight. We study a more tractable form of weak supervision: using a weak model as a critic rather than as a labeler or judge. Instead of solving the task or selecting the correct answer, the weak critic only needs to provide a non-misleading revision direction that helps the strong model better use its own knowledge. We call this setting weak-critic strong oversight. We first show that weak critiques can improve frozen strong models at inference time, and that critique quality is key to this improvement. We then propose progressive on-policy critique distillation (OPCD), which filters high-quality critiques and distills critic-guided behavior into the strong model through adaptive self-teacher signals. Experiments on reasoning and alignment benchmarks show that our method improves strong models over training epochs, suggesting an effective path for scalable oversight with weak supervision.
[AI-231] Agent xGCore: Agent ic AI for Next-Generation Mobile Core Network
链接: https://arxiv.org/abs/2606.00417
作者: Maria Katarine Santana Barbosa,Kelvin L. Dias
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for publication in IEEE Network
Abstract:To meet the stringent requirements of emerging applications and the increasingly complex network management and operation, the Next Generation Mobile Networks (NextG), or 6G, will adopt an AI-native architecture on the Core Network (CN). In this movement, the Third Generation Partnership Project (3GPP) has extended the cellular CN with new function as a first step toward integrating analytics, Artificial Intelligence (AI), and machine learning. However, those new functionalities are constrained by a centralized approach and managerial complexity. Furthermore, with the rise of Large Language Models (LLMs), a new era in network orchestration and management begins, leveraging and empowering the Intent-based Networking (IBN) paradigm. In addition, AI agents and Agentic AI integrate Reasoning and Acting (ReAct), enabling the usage of such intents to continuously interact with the network. Unlike state-of-the-art approaches that primarily employ Agentic AI to mitigate deployment and configuration complexity in the CN, this paper introduces AgentxGCore, which leverages an Agentic AI-Native layer to extend the 3GPP architecture and enable a system based on the existing APIs across the Beyond Next Generation Core (xGC) domain. This proposal establishes an AI-driven closed-loop for continuous optimization based on real-time information, enabling self-organization and self-adaptation. Our approach involves a multi-agent specialized system, divided into a network planner agent, capable of visualizing the network state and developing a plan to meet the intents, and a network executor, responsible for criticizing and executing the plan. To validate the proposed solution, an environment was built using an open-source CN, heterogeneous datasets, and different LLMs were employed to demonstrate its effectiveness.
[AI-232] PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning
链接: https://arxiv.org/abs/2606.00395
作者: Daize Dong,Junlin Chen,Haolong Jia,Jiawei Wu,Huanwei Di,Jiang Liu,Jialian Wu,Zhengzhong Liu,Zicheng Liu,Emad Barsoum,Dimitris N. Metaxas,Hongyi Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Mixture of Experts (MoE) Large Language Models (LLMs) achieve strong performance at scale. However, reinforcement learning (RL) on MoE-based LLMs often suffers from training instability. A root cause is router drift, i.e., expert activations can change drastically across model updates and differ between disaggregated rollout and training phases, causing large rollout–training mismatch and unstable importance sampling weights in PPO-style RL algorithms. Routing replay mitigates this issue by freezing the replay route within each reasoning trajectory, but it ignores how the router evolves under off-policy updates and thus causes router staleness. To address this limitation, we propose Predictive Routing Replay (PR2), which augments each router with a lightweight evolution predictor that learns to anticipate short-horizon router evolution. During the rollout phase, we use the predictive routing distribution to apply top- k routing, enabling gradients to reach experts that are likely to become active after updates. During the training phase, we replay the resulting predicted route to retain consistency for stable importance estimation. Theoretical analysis and experiments support that PR2 reduces routing-induced mismatch, improves RL stability, and yields stronger performance across various reasoning benchmarks.
[AI-233] Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization
链接: https://arxiv.org/abs/2606.00392
作者: Mingyi Wang,Zhuoer Shen,Yuheng Bu,Shaofeng Zou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.
[AI-234] Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems
链接: https://arxiv.org/abs/2606.00367
作者: Jonathan Colaço Carr,Prakash Panangaden,Doina Precup,Benjamin Van Roy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textitMarkov decision contest as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.
[AI-235] From “Weak” Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
链接: https://arxiv.org/abs/2606.00357
作者: Qi Sun,Siyue Zhang,Yulin Chen,Yuxiang Xue,Ru Peng,Chen Zhao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Training strong large language models (LLMs) requires high-quality supervision, which is often scarce. Recent work shows that paired preference data from weak-weaker model pairs (e.g., Qwen3 4B over 1.7B), despite the limited quality of individual responses, can provide an effective supervision signal through relative quality deltas, which we term a “weak” signal. This motivates a key research question: can multiple “weak” signals be constructively aggregated for improving strong models (e.g., Qwen3 8B)? To this end, we propose Preference Delta Aggregation (PDA), the first framework that derives a preference delta from each weak-weaker model pair, instantiates it as a LoRA adapter learned through preference optimization, and aggregates the resulting deltas via LoRA merging. To further mitigate directional interference during LoRA merging, we introduce Geometric Alignment Merging (GAM), a geometry-aware merging method that aligns adapter subspaces before aggregation, enabling more robust composition of diverse deltas. Evaluations on knowledge reasoning and agentic search benchmarks show that aggregating multiple “weak” signals pushes performance beyond any single signal, with further gains as additional signals are incorporated. Correspondingly, PDA with GAM improves the strong model by 6.8 and 7.3 points on average for knowledge reasoning and agentic search, respectively. It outperforms all single-delta and multi-delta baselines, exceeding the best single-delta baseline by 2.1 and 4.3 points. Further analysis attributes these gains to the effective composition of complementary capabilities encoded across distinct preference deltas.
[AI-236] Drift Q-Learning
链接: https://arxiv.org/abs/2606.00350
作者: Anas Houssaini,Mohamad H. Danesh,Amin Abyaneh,Scott Fujimoto,Hsiu-Chin Lin,David Meger
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value estimates. Diffusion and flow policies handle this trade-off by modeling the behavior distribution to regularize the RL objective, but they require iterative denoising, solver integrations, and in more efficient variants, distillation or other approximations at inference. We propose DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement. The value signal biases the policy toward high-value regions of the data support, while attraction and repulsion together keep generated actions near the data and prevent collapse onto a single mode. DriftQL is implemented as a single network with a unified training objective and generates actions in a single forward pass. On D4RL and OGBench, DriftQL consistently outperforms diffusion and flow methods, advancing the state of the art. Under degraded data quality, where the baselines visibly struggle, DriftQL remains close to its clean-data performance, positioning it as a promising alternative to diffusion and flow-based methods while maintaining the simplicity and efficiency of deterministic approaches. Project page: this https URL
[AI-237] (HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction ICML2026
链接: https://arxiv.org/abs/2606.00349
作者: Xianwei Zou,Sheikh Md Shakeel Hassan,Arthur Feeney,Aparna Chandramowlishwaran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: ICML 2026
Abstract:Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states from satellite data to recovering fluid states from imaging. When observations are incomplete, the inverse problem is fundamentally ill-posed: even when the underlying PDE dynamics are Markovian in the full state, partial observation operators induce a non-Markovian posterior that cannot be resolved from a single timestep. We propose a history-bootstrapped autoregressive flow matching (HB-ARFM) for spatiotemporal inverse reconstruction under partial observability. Observation history bootstraps the initial reconstruction via conditional flow matching, reducing ambiguities. The same conditional transport model is then applied autoregressively, conditioning on both new observations and past predictions to propagate the reconstruction forward in time. We evaluate the method on boiling dynamics reconstruction, recovering full velocity and temperature fields from interface geometry and motion. Across two inverse tasks with varying observation sparsity, HB-ARFM produces physically and temporally valid reconstructions where other models fail.
[AI-238] ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use
链接: https://arxiv.org/abs/2606.00341
作者: Jeremy Tien,Abishek Anand,Yu-Rou Tuan,Yuchen Shen,J. Zico Kolter,Aran Nayebi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures
Abstract:As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work has focused on agent safety in the presence of an adversary, we show that agents can exhibit misaligned behavior even in benign settings, taking unsafe actions when those actions are instrumental to task completion. We study this failure mode through the lens of corrigibility, the safety desideratum that agents remain amenable to human correction, interruption, or shutdown. To demonstrate this tendency, we introduce a benchmark in which agents are asked to complete realistic, computer-use tasks but are confronted with a corrigibility obstacle: a human interrupt, a login page, or a shutdown notification. We then evaluate whether agents choose to violate corrigibility in order to complete the task – overriding the human, accessing private passwords, rewiring shutdown. We find that the overwhelming majority of frontier models tested frequently bypass user interruptions or restrictions. In addition, better model performance appears to lead to greater misalignment. Finally, even when models are completely corrigible initially, we show there are no guarantees that the subagents they create are. Our work highlights the critical need for principled, corrigibility-focused alignment methods in autonomous agents.
[AI-239] From Noise to Control: Parameterized Diffusion Policies
链接: https://arxiv.org/abs/2606.00336
作者: Renhao Zhang,Haotian Fu,Mingxi Jia,George Konidaris,Yilun Du,Bruno Castro da Silva
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.
[AI-240] Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials NEURIPS2025
链接: https://arxiv.org/abs/2606.00315
作者: Edward W. Staley,Tom Arbaugh,Michael Pekala,Alexander New,Christopher D. Stiles,Nam Q. Le,Gregory Bassen,Wyatt Bunstine,Tyrel McQueen
类目: Artificial Intelligence (cs.AI); Materials Science (cond-mat.mtrl-sci)
备注: Accepted to the AI for Accelerated Materials Design (AI4Mat) Workshop at Neurips 2025
Abstract:Modern generative machine learning (ML) models can propose novel inorganic crystalline materials with targeted properties; however, synthesis planning of these materials remains difficult due to the complexity of the associated physical processes and limited availability of computational tools. We introduce a novel hybrid framework to evaluate Large Language Models (LLMs) in inorganic synthesis planning by combining thermodynamic databases with simplified kinetics models to approximate realistic synthesis conditions. As a case study, we focus on the niobium-oxygen system, which features multiple industrially relevant oxide phases with well-characterized data. In computational simulations, we compare LLM-generated synthesis routes with classical path-planning algorithms, showing that the implicit priors in LLMs can yield more viable strategies. In our evaluation setting, classical search methods serve primarily as a foil rather than a direct competitor. This illustrates the relative complexity of the problem and highlights where the LLM’s implicit priors add value.
[AI-241] DRL-Based Pose Control for Double-Ackermann Robots Under Actuation Uncertainties ICRA2026
链接: https://arxiv.org/abs/2606.00313
作者: Oussama Zaim,Mélodie Daniel,Aly Magassouba,Miguel Aranda,Olivier Ly
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 2 tables, Accepted for Uncertainty in Open-World Robotics an IEEE International Conference on Robotics Automation (ICRA 2026) workshop
Abstract:Robust deployment of deep reinforcement learning (DRL) policies on real robots remains challenging due to discrepancies between simulation and real-world dynamics. We address this issue in the context of maneuvering with double-Ackermann-steering mobile robots, which introduce additional constraints due to their non-holonomic nature. Building upon the DRL framework ManeuverNet, we extend its objective from position control to full pose control, resulting in a more challenging task. We further investigate the impact of actuation-related uncertainties on policy transfer. The use of simplified actuation models during training of the extended policy can lead to poor generalization, shown by a success rate drop from 100% in PyBullet to 25% in Gazebo under stricter evaluation conditions. To address this limitation, we adopt a sim-to-sim-to-real approach, where actuation effects observed in Gazebo are incorporated into the PyBullet training environment. Using multi-environment DRL with SAC and CrossQ, we learn policies that remain robust despite modeling inaccuracies. This approach can significantly reduce the performance gap across simulators, achieving up to 92% success rate in Gazebo and maintaining 69% under stricter thresholds, with successful transfer to a real robot without additional tuning.
[AI-242] How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval
链接: https://arxiv.org/abs/2606.00308
作者: Nazmus Ashrafi
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 16 pages, 7 figures, 7 tables
Abstract:Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall’s W and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster’s additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.
[AI-243] Rethinking the Role of Temperature in Large Language Model Distillation
链接: https://arxiv.org/abs/2606.00306
作者: Hoang-Chau Luong,Lingwei Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature \tau , overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from \tau scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at \tau=1 , FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.
[AI-244] Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
链接: https://arxiv.org/abs/2606.00288
作者: Hai Lin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are undergoing a transition from model technology to system technology. As developers use Codex, Claude Code, AutoGPT, and related agents to write code, manage projects, and execute multi-step tasks, recurring engineering problems such as cache reuse, context management, agent scheduling, and permission control increasingly resemble classical computer systems problems. This paper develops that analogy as a visionary survey. We map concepts from computer architecture to the emerging model-native stack and review work on LLM-as-OS, memory management, agent frameworks, tool protocols, multi-agent coordination, cognitive architectures, and safety governance. We argue that these strands address different layers of the same system but lack a unified model. To fill this gap, we propose the Intelligent Computing Architecture Model (ICAM), a six-layer framework for model-native computing with explicit interface contracts and design axioms. ICAM resolves the apparent tension over whether an LLM is more like a CPU or an operating system through a dual-plane view: a probabilistic execution plane concerned with what can be computed, and a deterministic control plane concerned with what should be computed. We further introduce three design laws: the Semantic Locality Law for KV-cache reuse and inference speedup, the Context Budget Law for effective working sets under finite windows and attention decay, and the Agent Speedup Law for diminishing returns in multi-agent collaboration. We validate these laws against published system-level data and relate them to recent evidence on agentic software practices. We conclude by identifying where the analogy breaks down and outlining a research roadmap for model-native computing. This is a conceptual and survey contribution; it does not report new experiments. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00288 [cs.AI] (or arXiv:2606.00288v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.00288 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-245] Evaluating Bivariate Causal Statements Based on Mutual Compatibility ICML2026
链接: https://arxiv.org/abs/2606.00278
作者: Erik Jahn,Dominik Janzing
类目: Artificial Intelligence (cs.AI)
备注: accepted for ICML 2026
Abstract:For many real-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop methods for evaluating collections of \binomn2 bivariate causal statements over a set of n variables. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable.
[AI-246] Robust Shielding for Safe Reinforcement Learning
链接: https://arxiv.org/abs/2606.00270
作者: Edwin Hamel-De le Court,Thom Badings,Alessandro Abate,Francesco Belardinelli,Francesco Fabiano
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.
[AI-247] Closed-Loop Neural Activation Control in Vision-Language-Action Models CVPR2026
链接: https://arxiv.org/abs/2606.00269
作者: Abhijith Babu,Ramneet Kaur,Nathaniel D. Bastian,Olivera Kotevska,Susmit Jha,Yanzhao Wu,Sumit Kumar Jha,Anirban Roy
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the IEEE/CVF CVPR 2026 Workshop on Visual Concepts (VisCon). 25 pages, 8 figures, including supplementary material
Abstract:Vision-Language-Action (VLA) models can be steered at test time by intervening on semantically meaningful internal directions, but existing methods use a fixed steering coefficient, effectively operating in open loop. This is poorly suited to embodied control, where task state and concept error evolve over time, often causing overcorrection, oscillation, and reduced task success, especially for temporal behaviors such as speed and smoothness. We propose CTRL-STEER, a closed-loop framework that replaces static intervention strength with adaptive, time-varying control signals. The key idea is to decouple representation from regulation: rather than assuming temporal concepts are directly controlled by individual neurons, we steer along motion-aligned residual directions while a feedback controller adjusts intervention magnitude online. We instantiate this framework with both PID and reinforcement learning based controllers. Experiments with a fine-tuned OpenVLA policy on four LIBERO task suites show that CTRL-STEER achieves more stable concept regulation and a better steering-task success trade-off than fixed-coefficient baselines, without modifying or retraining the base model.
[AI-248] When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE ICML2026
链接: https://arxiv.org/abs/2606.00262
作者: Melihcan Erol,Suat Evren,Oktay Ozel,Alexander Morgan,Jongha Jon Ryu,Lizhong Zheng
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
备注: Presented in ICML 2026
Abstract:InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textscWEINCE, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textscWEINCE yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.
[AI-249] ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate ICML
链接: https://arxiv.org/abs/2606.00257
作者: Rodney Lafuente-Mercado
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to DEMO 2026: ICML Workshop on Decision-Making from Offline Datasets to Online Adaptation. Non-archival report
Abstract:Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emphAdapter-Residual Credit Assignment (ARCA), a lightweight alternative that derives token salience from the adapter’s own hidden-state residual, |h^\textadapted_t - h^\textbase_t|_2 . ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.
[AI-250] Capability Self-Assessment: Teaching LLM s to Know Their Limits
链接: https://arxiv.org/abs/2606.00251
作者: Haoyan Yang,Reza Shirkavand,Yukai Jin,Jiawei Zhou,Shangqian Gao,Heng Huang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The ability to recognize one’s own limitations and decide whether to solve a problem or delegate is fundamental for reliable intelligent systems. Yet we show that modern large language models systematically lack this ability: across diverse model families and scales, they overestimate their competence and attempt queries they cannot solve. We refer to this ability as Capability Self-Assessment (CSA) and formulate it as a policy-learning problem, aiming to improve self-assessment while preserving the model’s original capabilities. Our results show that reinforcement learning teaches CSA effectively, significantly outperforming supervised fine-tuning while preserving original capabilities. In contrast, supervised fine-tuning severely degrades the capabilities the model is meant to assess. Moreover, learned self-assessment behavior generalizes well out of distribution, suggesting that CSA is a transferable model trait. Finally, CSA is practically useful: it improves local-cloud decision making at inference time and provides a signal for targeted data selection during training.
[AI-251] Geodesic Flow Matching for Denoising High-Dimensional Structured Representations ICML2026
链接: https://arxiv.org/abs/2606.00248
作者: Karim Habashy,Chris Eliasmith
类目: Artificial Intelligence (cs.AI)
备注: ICML 2026 Main track
Abstract:Vector Symbolic Algebras (VSAs) enable robust neurosymbolic reasoning by encoding symbolic information into high-dimensional distributed representations. For continuous domains, Spatial Semantic Pointers (SSPs) extend this framework by mapping variables onto continuous toroidal manifolds. However, standard approaches like Flow Matching assume a flat Euclidean geometry, which fails to account for the geometric constraints imposed on valid SSP states. We demonstrate that this assumption fails for SSPs: Euclidean linear interpolants ``cut through" the manifold’s interior, destroying the phase and magnitude structure required for accurate decoding. To resolve this, we employ Geodesic Flow Matching, adapting Riemannian transport dynamics to strictly restrict the denoising flow to the SSP toroidal manifold. We validate this approach in a Spiking Neural SLAM system, showing that manifold-aware cleanup stabilizes path integration against drift. The method achieves a 72% reduction in tracking error and enables a 40% increase in neural efficiency compared to competitive baselines. Code is available at this https URL .
[AI-252] InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate ICML2026
链接: https://arxiv.org/abs/2606.00241
作者: Zhengyang Hu,Yanzhi Chen,Hanxiang Ren,Qunsong Zeng,Youyi Zheng,Adrian Weller,Kaibin Huang,Yanchao Yang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted to ICML 2026
Abstract:Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly iterative optimization for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving 100\times speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation as an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.
[AI-253] IGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
链接: https://arxiv.org/abs/2606.00232
作者: Kaixiang Zhao,Tianrun Yu,Shawn Huang,Porter Jenkins,Yushun Dong,Amanda Hughes
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 7 figures, 16 tables. Under review
Abstract:We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model’s interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.
[AI-254] Continuous Reasoning for Vision-Language-Action
链接: https://arxiv.org/abs/2606.00229
作者: Yueh-Hua Wu,Tatsuya Matsushima,Kei Ota
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project page: this https URL
Abstract:Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student’s reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over \pi0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action. Comments: Project page: this https URL Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.00229 [cs.RO] (or arXiv:2606.00229v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.00229 Focus to learn more arXiv-issued DOI via DataCite
[AI-255] SEMBridge: Tagless-Final Program Semantics with Weakest-Precondition and Bounded-Checking Interpretations
链接: https://arxiv.org/abs/2606.00220
作者: Eric Liang
类目: Programming Languages (cs.PL); Artificial Intelligence (cs.AI)
备注:
Abstract:Formal methods provide rigorous accounts of program behavior, but practical software engineering often works through executable libraries, tests, and incremental design. This paper presents SEMBridge, a small tagless-final framework for generating weakest-precondition and bounded-checking interpretations from the same executable object programs. Instead of committing a program semantics to one abstract syntax tree and then writing separate traversals, object programs are written once against a semantic interface and interpreted into multiple meanings: readable code, concrete execution, predicate transformers, bounded counterexample search, and future proof-assistant or SMT back ends. The Python prototype implements a loop-free imperative core with assignments, conditionals, assumptions, and assertions. Across five example programs, the same tagless-final definitions generated executable state transformers and verification conditions that passed bounded checking over domains up to 729 states. The contribution is not a Scala code-generation system or a new verifier, but a compact architecture for keeping executable semantics, weakest-precondition artifacts, and bounded validation synchronized.
[AI-256] From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets ICML2026
链接: https://arxiv.org/abs/2606.00202
作者: Zakk Heile,Hayden McTavish,Varun Babbar,Margo Seltzer,Cynthia Rudin
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Standard machine learning pipelines often admit many near-optimal models. These “Rashomon sets” pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at this https URL
[AI-257] Learning to Construct Practical Agent ic Systems
链接: https://arxiv.org/abs/2606.00189
作者: Aditya Kumar,Zhihan Lei,Jerry Yan,Joshua W. Momo,Lauhitya Reddy,Rafael Enrique Cabrera Jimenez,Cassandra A. Cohen,Arthur Kajiyama,William W. Cohen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining “pseudo-tools” that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.
[AI-258] Agent ic Transformers Provably Learn to Search via Reinforcement Learning
链接: https://arxiv.org/abs/2606.00183
作者: Tong Yang,Yu Huang,Yingbin Liang,Yuejie Chi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Machine Learning (stat.ML)
备注:
Abstract:Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic k -ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth- 1 and depth- 2 trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.
[AI-259] Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection
链接: https://arxiv.org/abs/2606.00180
作者: Xiaojing Chen,Jingqi Cheng,Xu Zhao,Wan Jiang,Jingjing Wu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the “small-sample dilemma.” Prevailing generative data augmentation methods not only incur heavy computational overhead but also risk introducing synthetic noise, thereby blurring classification boundaries. To challenge the traditional “data quantity first” convention, we propose a novel framework “Beyond Augmentation”: Score-Guided Classification (SGC). SGC does not synthesize pseudo-samples; instead, it utilizes an unsupervised generative network architecture to model the structural and statistical anomaly degrees of samples, serving as the core “Pathological Prior”. This prior, after robust normalization, is explicitly fused with deep feature representations, thereby precisely guiding the classifier’s decision boundary. Furthermore, to dynamically adapt to varying channel configurations, we propose a Cross-Channel Spatial Adaptation module, utilizing a spatial mapping mechanism to effectively resolve the hardware heterogeneity of mismatched channels in multi-center datasets. Extensive experiments on the Mumtaz2016 and high-density MODMA datasets demonstrate the effectiveness and exceptional generalizability of our method under the challenging “zero data augmentation” setting and at “zero sample synthesis cost”. Keywords: Electroencephalography (EEG), Depression Detection, Anomaly Score, Diffusion Models, Few-Shot Learning Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00180 [cs.LG] (or arXiv:2606.00180v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00180 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-260] CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO
链接: https://arxiv.org/abs/2606.00172
作者: Yang Li,Gongle Xue,Yijia Guo,Yuheng Yuan,Liwen Hu,Lei Ma
类目: Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:Reinforcement learning with verifiable rewards (RLVR), especially Group Relative Policy Optimization (GRPO), has been widely used to improve reasoning in large language models. However, outcome-level rewards provide only sparse supervision, and group-relative advantages vanish when all sampled trajectories for a prompt are either correct or incorrect. On-Policy Self-Distillation (OPSD) offers dense token-level guidance, but its token preferences are not necessarily aligned with trajectory correctness; empirical diagnostics show that OPSD signals behave differently on correct and incorrect rollouts, with teacher-positive and teacher-negative gap signals exhibiting different noise profiles. These diagnostics are conducted under an OPSD-style privileged teacher context for analysis only, whereas CAST training uses answer-free self-teacher this http URL by these observations, this work proposes CAST, an answer-free self-distillation method for GRPO-style RLVR. CAST keeps the verifier-grounded GRPO objective, but uses a stop-gradient self-teacher to shape token-level advantages according to trajectory correctness. Unlike prior self-distilled RLVR methods, CAST does not require reference-solution-conditioned teacher scoring, keeps the self-teacher log-probability gap active throughout training, and applies bidirectional local advantage sign reversal: teacher-negative tokens in correct trajectories can receive negative token-level advantages, while teacher-positive tokens in incorrect trajectories can receive bounded positive local advantages. For zero-variance all-correct and all-wrong groups, CAST assigns bounded sign-constrained base advantages, so these otherwise zero-gradient groups can contribute verifier-signed token feedback. Experiments on mathematical reasoning show that CAST improves RLVR training while retaining a lightweight, verifier-grounded trajectory-level objective.
[AI-261] ChurnNet: A Optimized Modern AI for Churn Prediction
链接: https://arxiv.org/abs/2606.00169
作者: Syed Saad Saif,Giulio Maggiore,Paolo Russo,Damiano Distante
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to switch to competitors. Accurate churn prediction can be a valuable tool for driving effective personalized marketing campaigns and helping to reduce customer attrition. This study evaluates the performance of traditional machine learning techniques, namely, Random Forests, XGBoost, and Support Vector Machines, and compares them with the Unified Multi-Task Time Series Model for churn prediction, a binary time-series classification task. Despite the strong capacity of the latter to model complex temporal dynamics and inter-variable relationships, our results indicate that for churn prediction, conventional methods can still outperform it in terms of predictive performance, data efficiency, and computational resource requirements for training and deployment. These findings are consistent across multiple datasets and various churn labeling techniques.
[AI-262] Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data
链接: https://arxiv.org/abs/2606.00161
作者: Muhammad Khuram Shahzad,Haseeb Khan,Muhammad Masood Khan,Mubashra Bibi
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 14 figures; code and results publicly available
Abstract:The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perhaps the biggest of them is related to the presence of a class imbalance in the side-channel dataset, where the number of samples in the normal class compared to the attacks can reach a ratio of 75,964 to 1. Such an aspect is addressed by Dominguez et al. through the proof of concept of power-based intrusion detection. Unfortunately, neither the authors attempt to cope with the problem of imbalance nor do they assess the classifier performance using a balanced training set. In the current paper, both aspects will be handled at once. First, a Synthetic Minority Oversampling Technique (SMOTE) was performed on all nine possible datasets extracted from the initial one, providing an exact imbalance ratio of 1.1 for each. Then, eight algorithms i.e. Random Forest, HistGradientBoosting, LightGBM, Extra Trees, XGBoost, k-Nearest Neighbors, Multi-Layer Perceptron, and Decision Tree were trained under identical conditions for the SMOTE balanced 6-hour dataset. Random Forest reached a micro-averaged F1 score of 0.9989 and macro F1 of 0.9794, thus outperforming the previously best micro-F1 result obtained by Time Series Forest algorithm from the base paper of 0.9983. Extra Trees provided the same performance as well, but at 10 times faster. The introduction of a macro-F1 metric explicitly in contrast to the base paper assessment reveals important class-level information missed with aggregate performance metrics. Recall rates per-class calculated with confusion matrices, F1 heatmaps, and ROC curves show that minority attack classes, especially those with combined M+L infections, are detected reliably only when using SMOTE balance. Feature importance analysis indicates the latest time steps as the most important predictor signals out of 60 steps in a power window.
[AI-263] A Protocol-Language Model for Network Intrusion (Without Deep Packet Inspection)
链接: https://arxiv.org/abs/2606.00155
作者: Vivek Kumar Sharma
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages Research paper on Packet Language Models for Network Intrusion Detection Systems(Without Deep Packet Inspection).Code available on GitHub
Abstract:Modern network intrusion detection systems (NIDS) are caught in a structural contradiction: the protocols carrying the highest threat intelligence are precisely those encrypted under TLS 1.3 and QUIC, where payload inspection yields nothing. We ask a simpler question – what if the attack signature is not in the bytes, but in the rhythm? – and answer it by treating network flows as a language whose grammar is written entirely in L3/L4 packet metadata: length, inter-arrival time, TTL, TCP flags, and hashed port numbers. We present PLM-NIDS, which proves three claims in sequence. (1) The grammar exists and is learnable: a RWKV-4 state-space model trained on 344,232 unlabelled Monday flows achieves a causal LM validation loss of 0.204, demonstrating that benign traffic has predictable, statistically consistent structure. (2) Attacks violate this grammar: the per-flow perplexity score cleanly separates benign from attack flows with PR-AUC = 0.93 using zero attack labels at training time. (3) This separation is architecturally nontrivial: an LSTM trained on identical token sequences degenerates to a majority-class predictor (ROC-AUC approximately 0.50, F1 = 0.91 by always predicting “attack”), proving that RWKV’s causal pre-training provides an inductive bias unavailable to direct classifiers. Supervised fine-tuning further raises PR-AUC to 0.94 and ROC-AUC to 0.75, with a precision of 97.7% at the calibrated operating threshold. The RWKV backbone’s O(T) recurrent inference enables per-packet streaming without flow buffering, making PLM-NIDS operationally viable at line rate. Because it reads only IP/TCP/UDP headers, it is inherently encryption-agnostic: TLS 1.3, QUIC, and future encrypted protocols are handled transparently.
[AI-264] Benchmarking Multimodal LLM s on Code Generation for Complex Interactive Webpages
链接: https://arxiv.org/abs/2606.00154
作者: Fan Wu,Lishuai Dong,Cuiyun Gao,Yujia Chen,Yiming Huang,Yang Xiao,Qing Liao
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in multimodal large language models (MLLMs) have achieved remarkable progress in multimodal reasoning and code generation, catalyzing a new paradigm for front-end development. In particular, these models can directly transform visual designs into executable code, significantly improving the efficiency and adaptability of web development. Modern web applications are dynamic and interactive, featuring frequent user-page interactions. However, existing benchmarks largely evaluate the code generation of static webpages, ignoring the complex interactive behaviors in real-world applications. Besides, their evaluation criteria remain confined to visual fidelity and code structure, overlooking the interaction consistency between the generated and the reference webpages. To address these limitations, we introduce WebIGBench, the first benchmark designed to evaluate code generation for interactive webpages with complex interactions. By combining manually designed interaction paths with UI automation, we collected 103 complex webpages from real-world websites. This benchmark covers 5 popular interactive action types (e.g., click, input) involving 871 distinct interactive actions. Moreover, we propose a novel evaluation pipeline to address the gap in automated assessment of interactive actions. Extensive experiments on several representative MLLMs reveal the performance boundaries of current models in interactive webpage code generation using WebIGBench. The proposed benchmark is available at this https URL.
[AI-265] PrivacyPeek: Auditing What LLM -Based Agents Acquire Not Just What They Say
链接: https://arxiv.org/abs/2606.00152
作者: Mingxuan Zhang,Jiahui Han,Dadi Guo,Songze Li,Guanchu Wang,Na Zou,Dongrui Liu,Xia Hu
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures
Abstract:LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often acquire more sensitive information than the task requires. Existing privacy benchmarks audit what the agent’s response or outgoing actions disclose, but overlook the acquisition stage where data first enters the agent’s context. The over-acquired information is then one careless action or one attack away from an outright leak. To assess its prevalence, we introduce \emphPrivacyPeek, a benchmark for evaluating acquisition-stage privacy leakage of LLM-based agents, with 1,182 cases across 7 acquisition behaviours and 16 application domains. Specifically, \emphAcquisition Inspection examines the agent’s tool-call trajectory, both the tools it invokes and the data it receives, to detect when it acquires sensitive information beyond the task scope. \emphProbe Elicitation then issues a follow-up probe and measures how readily an attacker could elicit sensitive information the agent acquired but did not disclose. Our experiments on 10 LLM-based agents across 4 model families show that the unnecessary acquisition of sensitive information is widespread. In addition, we observe a correlation between the task-completion capability and acquisition-stage leakage. Prompt-level defences reduce only a small fraction of acquisition-stage leakage, leaving the majority unmitigated. These results make auditing acquisition-stage privacy both urgent and necessary. Our dataset and code are available at this https URL.
[AI-266] Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
链接: https://arxiv.org/abs/2606.00151
作者: Soichiro Nishimori,Paavo Parmas,Sotetsu Koyamada,Tadashi Kozuno,Toshinori Kitamura,Shin Ishii,Yutaka Matsuo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over M samples, where M is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count M to a continuous parameter m 0 , enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.
[AI-267] Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models
链接: https://arxiv.org/abs/2606.00150
作者: Junyoung Park,Seongyong Ju,Sunghwan Park,Jaewoo Lee
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As Large Language Models evolve for user convenience, vulnerability to jailbreak attacks continues to be reported despite ongoing efforts in safety training. Traditional jailbreak techniques typically focus on a single prompt injection, neglecting the models’ ability to remember the flow of conversation and the user’s instructions. In this paper, we propose Persona Attack, a memory injection based jailbreak method that manipulates the model’s context window through a step by step approach. Experimental results from applying Persona Attack to several widely used LLMs reveal that, as injections accumulate in memory, models increasingly prioritize these instructions over their internal safety alignment mechanisms. Furthermore, our experiments empirically demonstrate that the attack success rate varies not only according to the memory implementation of the model, but also combinations of instructions and can reach 95% under specific instruction configurations.
[AI-268] RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting
链接: https://arxiv.org/abs/2606.00147
作者: Yuduo Li,Xiaofeng Shi,Qian Kou,Longbin Yu,Hua Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: preprint
Abstract:Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model’s general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model’s natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model’s behavior on its own generated prefixes. This process fails to preserve the model’s original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.
[AI-269] Completion at the Boundary (CaB): Deployable Switching with Completion-Aware Control under Limited Calibration
链接: https://arxiv.org/abs/2606.00145
作者: Yusuke Sano,Takeshi Itoga
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-language-action (VLA) agents can execute natural-language instructions, yet deployed systems still lack an operational interface: deciding when the instruction is complete. This gap is acute in short composites (“do A, then B”), where mistimed handoffs cascade into downstream failures. Completion is inherently closed-loop because switching is an intervention that changes the instruction context and thus future actions and observations. We study completion under a deployable low-calibration regime motivated by open-ended instruction spaces, enforcing no test-time relearning and a single globally calibrated switching rule selected once on development set and reused unchanged on test set. Under this constraint, collapsing asymmetric boundary evidence into a single scalar can be brittle under polarity shifts across tasks. We propose Completion at the Boundary (CaB), which predicts an event-local completion object in the form of Boundary-Phase Tokens (Before/Hit/After), retaining two-sided boundary evidence under this discipline. CaB-When converts this completion object into a minimal, auditable switching decision (when), while CaB-How reuses the same completion object to condition action generation for boundary-stable control through handoffs (how). Using an intervention-aware E1/E2 protocol, we show that CaB improves composite execution and handoff quality on a first-person Minecraft VLA benchmark under matched capacity and deployability constraints.
[AI-270] BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
链接: https://arxiv.org/abs/2606.00144
作者: Liang He,Jingbo Wen,Qishi Zhan,Yixiong Chen,Kangning Cui,Qizhen Lan,Xilu Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K–16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.
[AI-271] Adaptive data selection improves wearable prediction under low baseline performance
链接: https://arxiv.org/abs/2606.00141
作者: Ali Kargarandehkordi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance under limited data budgets, yet their benefits across individuals remain poorly understood. Here, we evaluate adaptive selection of time windows for model training under fixed measurement budgets across multiple sensing modalities, including heart rate, activity, and ecological momentary assessment (EMA), in a longitudinal wearable dataset. We quantify performance gains relative to random sampling using both area under the receiver operating characteristic curve (AUROC) and F1 score. Adaptive strategies yield substantial improvements in AUROC for participants with low baseline performance (with gains up to 0.7), while offering limited or negative gains for participants with strong baselines. Across modalities, adaptive gain is strongly inversely correlated with baseline performance (Pearson r = -0.67; Spearman p = -0.62). At the participant level, most individuals benefit in AUROC (60-80% across modalities), although improvements in F1 are smaller and less consistent. These findings show that adaptive sensing is not uniformly beneficial, but instead provides the greatest value in underperforming settings. Our results support selective deployment strategies that tailor adaptive sensing based on baseline performance to improve efficiency in wearable health monitoring.
[AI-272] Geometric Erasure by Contrastive Velocity Matching in Rectified Flows
链接: https://arxiv.org/abs/2606.00140
作者: Jonas Henry Grebe,Tobias Braun,Anna Rohrbach,Marcus Rohrbach
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.
[AI-273] A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems
链接: https://arxiv.org/abs/2606.00138
作者: Titu Ranjan Sarker,Muhammed Jawaad Zulqernine,Ling Yue,Shaowu Pan,Chenxi Wang,Shiyao Lin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Finite element analysis (FEA) is the most important numerical approach for solid mechanics. Challenges of FEA include a steep learning curve for entry-level users and potential false simulations due to incorrect definitions of key simulation components, such as boundary conditions, load cases, and solution variables. Years of engineering experience are usually necessary for real-world problem-solving. To address these issues, we present AbaqusAgent, a multi-agent framework grounded in large language models (LLMs) for solid mechanics analyses. AbaqusAgent is developed to facilitate analysis case generation and execution using Abaqus, one of the most widely used FEA packages, by turning users’ natural-language instructions into executed FEA analyses and result visualization. AbaqusAgent is composed of six agents, including interpreter, architect, input writer, runner, reviewer, and visualizer agents, encompassing all the essential pre-processing and post-processing steps of standard FEA analyses. A wide variety of 50 solid mechanics problems have been successfully validated, achieving an overall success rate of 86%. Beyond improving the efficiency of FEA for solid mechanics problems and lowering the barrier to computational mechanics education, AbaqusAgent advances the human-simulation interaction paradigm and enables integration with AI-empowered optimization and material characterization workflows. The code is available at this https URL
[AI-274] On Effectiveness and Efficiency of Agent ic Tool-calling and RL Training ICML2026
链接: https://arxiv.org/abs/2606.00135
作者: Tong Liu,Cheng Qian,Matej Cief,Yuan He,Daniele Dan,Nikolaos Aletras,Gabriella Kazai
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026
Abstract:Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.
[AI-275] XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT
链接: https://arxiv.org/abs/2606.00134
作者: Ambreen Aslam,Maaz Hassan,Bibi Zahra,Muhammad Khuram Shahzad
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 6 figures; code available at this https URL
Abstract:Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy-preserving solution; however, existing approaches such as SOH-FL suffer from two key limitations: reliance on a manually tuned aggregation parameter \gamma and lack of explainability in model predictions. In this paper, we propose XAI-SOH-FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH-FL paradigm. First, we introduce a dynamic \gamma selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal \gamma values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature-level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1-score of 0.92, outperforming the baseline SOH-FL model while converging in fewer communication rounds. Furthermore, SHAP-based analysis reveals that flow-level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI-SOH-FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.
[AI-276] Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization
链接: https://arxiv.org/abs/2606.00132
作者: Dongjun Kim,Adrian de Wynter,Huancheng Chen,Heasung Kim,Haris Vikalo
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through specialized initialization or fixed constraints, but do not regulate the adaptation preservation trade-off during training. We propose Foundation Preserving LoRA (FoLoRA), a forgetting aware optimization framework. Guided by a first order preservation condition, FoLoRA defines a forgetting penalty over pretraining-proxy activations and a task utility over downstream task activations. It then scores update directions by task utility per unit forgetting penalty via a generalized Rayleigh quotient. The resulting spectral coordinate system enables direction wise gated Adam updates, attenuating low utility to penalty directions during training. To estimate the forgetting penalty, FoLoRA constructs pretraining proxy calibration data by sampling from the pretrained model rather than relying on a single proxy dataset. Experiments on math, code, and instruction following adaptation show that FoLoRA achieves the strongest preservation adaptation balance over baselines, improving target task performance with best aggregate preservation of non target capabilities.
[AI-277] AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve
链接: https://arxiv.org/abs/2606.00131
作者: Chaitanya Mamatha Ananda,Rajiv Gupta,Mircea Trofin,Aiden Grossman,Sriraman Tallam,Xinliang David Li,Amir Yazdanbakhsh
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:
Abstract:Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.
[AI-278] Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks
链接: https://arxiv.org/abs/2606.00130
作者: Andrzej Cichocki,Michal Wietczak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 figure, 28 pages, to be submitted to Journal and confrence
Abstract:We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tensors are trained end-to-end by reverse-mode automatic differentiation (AD). The approach can be viewed as a natural extension of low-rank adaptation and tensor factorisation: instead of using one low-rank matrix update, an ADNTN builds a large weight tensor through a hierarchy of small cores, nonlinear activations, and optional lateral mixing tensors. The paper focuses on three architectures: Tree Tensor Networks (TTNs), augmented TTNs (aTTNs) with boundary disentanglers, and Multi-scale Entanglement Renormalisation Ansatze (MERA). The formulation supports nonlinear activations, task-aware objectives, batching, and hardware-aware execution schedules. At the same time, the paper keeps a clear distinction between \emphdifferentiating a contraction program and making contraction free: AD does not remove the cost of large intermediates, poor contraction orders, or exact contraction of general loopy tensor networks. Extensive simulations on AlexNet and VGG-16 layers show per-layer compression ratios from roughly 2000\times to 77000\times in the studied settings, with accuracy often matching the dense baseline and, in several VGG-16 cases, improving it. These results are encouraging rather than final: they suggest that ADNTNs are a promising, mathematically structured, and hardware-aware route toward much smaller neural networks, provided that optimisation, contraction schedules, and deployment kernels are designed together. Comments: 6 figure, 28 pages, to be submitted to Journal and confrence Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00130 [cs.LG] (or arXiv:2606.00130v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00130 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-279] A Shared Valence Axis Across Modern LLM s and Human EEG: The Saturation Regularity
链接: https://arxiv.org/abs/2606.00129
作者: Yousef A. Radwan,Xuhui Liu,Kilichbek Haydarov,Yuqian Fu,Mohamed Elhoseiny
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cognition. We study whether modern LLMs can serve as a lens for understanding neural representations in the human brain, focusing on emotional valence in EEG. We first build a one-dimensional valence direction, the V-axis, from modern LLMs using only nine emotion-evocative sentences. We validate it through zero-shot transfer to sentiment benchmarks and cross-model consistency across fourteen LLMs. We then show that this LLM-derived direction maps onto human neural activity. On a public EEG cohort of 123 subjects watching affective videos, a single linear projection on EEG features tracks the V-axis position of each stimulus. Moreover, 36 EEG emotion classifiers trained without exposure to the V-axis spontaneously rediscover the same direction in their internal representations, suggesting that the same valence structure emerges in both language models and human electrophysiology. Yet this convergence does not provide an effective training signal. We test twenty-five alignment strategies, including knowledge distillation, representational similarity, contrastive, and topographic losses; none improve decoding, and sixteen significantly reduce accuracy. We formalize this result as the saturation regularity: once task labels alone drive a brain-decoding network onto the target direction, additional supervision mainly distorts an already-saturated basin, while the load-bearing within-class residual receives little useful gradient. This regularity also indicates where improvement should come from: the residual subspace unreachable by supervision. Motivated by this insight, we ensemble across residual diversity rather than supervising the basin, improving balanced accuracy by 10.5% over the prior best on FACED, with the same effect replicated on SEED-V. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00129 [cs.LG] (or arXiv:2606.00129v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00129 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Yousef Radwan [view email] [v1] Thu, 28 May 2026 19:42:10 UTC (4,720 KB)
[AI-280] V2I Work Zone Geometry Reconstruction with Pose-Conditioned UWB Range Denoising
链接: https://arxiv.org/abs/2606.00119
作者: Jiaxi Liu,Hangyu Li,Yang Cheng,Rui Gana,Junwei You,Weizhe Tang,Peng Zhang,Steven T. Parker,Xiaopeng Li,Bin Ran
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Reliable work zone mapping is important for connected and autonomous vehicles (CAVs) to navigate safely and smoothly through work zone areas. Cone-mounted ultra-wideband (UWB) roadside units (RSU) offer a cost-effective way for work zone layout inference, as roadside anchors and vehicle tags provide direct vehicle-to-infrastructure (V2I) range constraints for work zone geometry reconstruction. However, UWB range estimation is degraded by bursty outliers, non-line-of-sight (NLOS) errors, arbitrary anchor-ordering issues, and vehicle pose uncertainties in practical field deployments. To address these challenges, this study proposes a pose-conditioned, permutation-equivariant predictive denoiser for multi-anchor UWB ranging. The model employs shared anchor-wise temporal prediction to capture range dynamics, symmetric set aggregation to handle unordered and missing anchors, and pose-conditioned residual decoding to incorporate vehicle motion as a geometric prior. A two-stage training strategy first learns prediction from observed ranges, and then fine-tunes the denoiser with NLOS-weighted supervision. The method is evaluated on rare real-world V2I UWB field data collected with a CAV, as well as on controlled large-scale simulation benchmarks for ablative insights. Results show that the proposed method substantially improves range accuracy, cone localization, and work zone geometry reconstruction in challenging NLOS-dominated regimes, remains robust to anchor re-indexing and moderate anchor dropout, and reduces measurement-weighted field MSE by 66.9% relative to the raw input.
[AI-281] PEACE: A Planner-Executor Agent with Constraint Enforcement for UAVs ICRA2026
链接: https://arxiv.org/abs/2606.00104
作者: Erdem Uysal,Timo Kehrer,Sebastiano Panichella
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction
Abstract:Foundation models are increasingly used to drive autonomous systems, yet existing approaches either keep the model in a tight control loop, raising latency and hallucination risk, or compile natural language into opaque end-to-end policies that are hard to explain, constraint and require domain-specific datasets and fine-tuning. We propose a planner-executor agent for PX4-based drones that decouples high-level mission planning from low-level control. A large language model performs single-pass task planning, while execution is handled through a structured ROS 2 tool-calling interface bridged to MAVLink. The system constructs a world model by combining modular 2D detectors (e.g., YOLO or vision-language models) with a pinhole depth projection module for 3D object localization. A constraint enforcement layer enforces altitude limits and horizontal geofencing, and bounded replanning enables recovery from execution-time action failures. We position our approach within three common design patterns for foundation-model-based robotics systems and demonstrate its feasibility in PX4 software-in-the-loop simulations in Gazebo. Results highlight improved explainability, constraint enforcement, and reduced LLM calls compared to tightly coupled LLM control. The code, dataset, videos, and other material can be found at the following link: this https URL
[AI-282] Evaluating Interactive Reasoning in Large Language Models : A Hierarchical Benchmark with Executable Games
链接: https://arxiv.org/abs/2606.00103
作者: Mingyuan Fan,Weiguang Han,Daixin Wang,Cen Chen,Zhiqiang Zhang,Jun Zhou
类目: Artificial Intelligence (cs.AI)
备注: preprint version, under review
Abstract:We introduce a multi-turn interactive framework for reasoning evaluation that treats reasoning as active evidence acquisition and belief updating. Wherein, LLMs receive only the task rules, must issue targeted queries to a hidden environment, integrate partial observations over time, and decide when to submit a final answer. Beyond standard success rate and interaction efficiency, we evaluate contextual robustness under controlled contextual perturbations, and metacognitive adaptation through counterfactual revision and necessity judgment. We instantiate the framework as a benchmark of 474 executable games, each evaluated under five fixed configuration search spaces corresponding to five difficulty levels, and evaluate a broad set of frontier LLMs. Results show that the benchmark is highly discriminative, exposing large differences not only in success rate but also in interaction efficiency. Moreover, we empirically show that contextual perturbations cause moderate but consistent declines, whereas counterfactual revision and necessity judgment lead to much larger drops.
[AI-283] On the evolution of the concept of probability as a mirror of the evolution of reason
链接: https://arxiv.org/abs/2606.00102
作者: Jean-Louis Le Mouël,Vincent Courtillot,Dominique Gibert,Vladimir Kossobokov,Jean-Baptiste Boulé,Pierpaolo Zuddas,Fernando Lopes,Païkan Marccagi,Alexis Maineult
类目: Artificial Intelligence (cs.AI); Probability (math.PR)
备注: 44 pages, 7 figures
Abstract:Over the centuries, probability theory has grown from the calculus of games of chance into a central framework for reasoning under uncertainty. This article interprets that evolution not merely as a mathematical history, but as a transformation of rationality itself. From Pascal and Fermat’s combinatorial symmetry to the inductive logic of Bayes and Laplace, from Poisson’s statistics of events to Kolmogorov’s axiomatic formalization, probability progressively incorporated uncertainty, time, and coherence into scientific judgment. This trajectory reaches a mature epistemological form in modern Bayesian inference, especially in Tarantola’s view of probability as a logic of information, where prior knowledge and data are combined coherently. Yet this framework also exposes a limit: probability quantifies uncertainty about well-defined propositions, but does not by itself formalize the vagueness of the concepts used to describe them. The article therefore examines how rationality extends beyond probability. Fuzzy logic is presented as a rigorous language for graded meaning and qualitative judgment, while deep learning is analyzed as a distinct, powerful mode of prediction based on geometric interpolation and optimization rather than explicit inference. By situating probability, fuzzy logic, and deep learning in a common historical and epistemological perspective, the article clarifies their roles and limits. It argues that contemporary scientific rationality cannot be reduced to data-driven performance alone, but requires the explicit articulation of uncertainty, vagueness, and inference.
[AI-284] Silent Failures in Physical AI: A Literature Review of Runtime Action Authorization for Autonomous Systems
链接: https://arxiv.org/abs/2606.00090
作者: Barak Or
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 23 pages
Abstract:Physical AI systems increasingly map multimodal observations, language instructions, and learned world representations into physically consequential actions. Robotics foundation models, vision-language-action models, and world-model-based autonomous systems can condition decisions that move vehicles, robots, drones, and industrial machines. This transition exposes a safety problem that is not fully captured by conventional AI content moderation or by classical robot safety alone: a black-box model may issue a physically consequential action while appearing confident, plausible, and semantically aligned. The resulting failure can be silent, arising from sensor drift, occlusion, state-estimation error, distribution shift, hallucinated affordances, or invalid physical assumptions before downstream hardware controllers detect a violation. Across embodied foundation models, world models, robotics simulation, embodied safety benchmarks, safe control, runtime assurance, uncertainty estimation, verification, and guardrail evaluation, model capability and safety mechanisms have advanced along largely separate technical tracks. A recurring gap synthesized here is that no single stream surveyed in this review supplies a complete runtime authorization boundary between black-box Physical AI models and physical execution. The resulting analysis develops a bounded problem formulation, a definition of silent physical-action failure, a taxonomy of runtime guardrail functions, and evaluation requirements for comparing guardrails as Physical AI assurance mechanisms. Comments: 23 pages Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00090 [cs.RO] (or arXiv:2606.00090v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.00090 Focus to learn more arXiv-issued DOI via DataCite
[AI-285] Can Predicted Dynamics Exist in the Physical World?
链接: https://arxiv.org/abs/2606.00089
作者: Barak Or
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Predictive Physical AI systems output state rollouts, action chunks, and latent plans, yet a low root-mean-square error (RMSE) does not imply that a particular proposal is physically executable. We formulate physical admissibility as a prediction-control interface: before execution, a decoded proposal is treated as candidate dynamics and evaluated using kinematic, dynamic, and direct-to-composed horizon conditions. Passing is not a certificate of task success; rejection identifies violation of the specified physical envelope and gives a component-level reason. On Hugging Face LeRobot PushT, controlled falsification shows that one-step prediction-RMSE and standardized dynamics residuals reach area under the receiver operating characteristic curve (AUC) 0.982 and 0.972, kinematic-only conditions reach AUC 0.592, and the full gate reaches AUC 0.957 with condition-level attribution. In replay-based intervention experiments, residual-based filters and the full physical-admissibility gate prevent 87- 89% of invalid proposals while preserving mean progress near 0.998.
[AI-286] From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
链接: https://arxiv.org/abs/2606.00083
作者: Christian Gumbsch,Leonardo Barcellona,Lennard Schünemann,Platon Karageorgis,Andrii Zadaianchuk,Zehao Wang,Sergey Zakharov,Fabien Despinoy,Rahaf Aljundi,Efstratios Gavves
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.
[AI-287] Hoeffding Concept Bottleneck Models with Applications to Overhead Images
链接: https://arxiv.org/abs/2606.00082
作者: Clément Bénard,Manon Arfib,Christophe Labreuche,Victor Quétu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.
[AI-288] DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions
链接: https://arxiv.org/abs/2606.00081
作者: Michel Dione(CERI SN - IMT Nord Europe),Jerry Lonlac(CERI SN - IMT Nord Europe),Hélène Louis(CERI SN - IMT Nord Europe),Anthony Fleury(CERI SN - IMT Nord Europe),Stephane Lecoeuche
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open \Phi -OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at this https URL
[AI-289] BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization
链接: https://arxiv.org/abs/2606.00079
作者: Jiayu Zhao,Zihan Teng,Minhao Fan,Tianrui Ma,Wentao Ren,Song Chen,Weichen Liu
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures, 9 tables. Code and models are available at this https URL
Abstract:Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3 \times , improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76 \times over GPTQ. Our model and code are publicly available at this https URL.
[AI-290] Rare Events Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks
链接: https://arxiv.org/abs/2606.00073
作者: Aditi Aravind,Konstantinos Ladakis,Mario Alexios Savaglio,Stelios M. Smirnakis,Maria Papadopouli
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework for analyzing deep spiking neural networks (SNN) through the lens of functional connectivity. Drawing on concepts from systems neuroscience and information theory, we form the first-order functionally-connected (1FC) group of a neuron based on its statistically significant pairwise correlations with neurons from the previous layer of a trained SNN architecture. We then track its response properties during inference under various conditions. Our analysis shows that several principles of functional connectivity previously observed in biological cortex are preserved in spiking ResNet architectures. These 1FC ensembles display interesting properties: their aggregate cofiring reliably predicts downstream neuronal responses through a robust, ReLU-like input-output relationship, whose gain scales systematically with ensemble size. Reliable encoding of the presented class emerges only during high 1FC cofiring events, which themselves occur infrequently, indicating that informative representations are concentrated in rare but highly coordinated activity patterns. Under uniform random noise or adversarial perturbations, these response profiles are disrupted, particularly in early and intermediate layers. This enables a targeted high-resolution interrogation at specific nodes and pathways. We showed that the functional connectivity structure is shaped by learning and this structure breaks under weight permutation. These establish 1FC ensembles as a functionally meaningful substrate for input encoding and information transfer, with potential implications in designing targeted fine-grained diagnostics on the information flow.
[AI-291] Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic Vacuum and Surcharge Preloading Considering Smear Effects
链接: https://arxiv.org/abs/2606.00056
作者: Dong Li,Yapeng Cao,Shuai Huang,Yujun Cui,Haiping Fu,Lu Yang,He Wei
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applied Physics (physics.app-ph)
备注:
Abstract:This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation considering smear effects and combined vacuum and surcharge loading. Three PINN-based models are investigated: a standard soft-constrained PINN (Std-PINN), a modified gated PINN (Mod-PINN), and a modified gated PINN with hard-constraint boundary encoding (Mod-HC-PINN). The models are evaluated against FEM reference solutions under four loading cases, including constant vacuum, exponential vacuum, exponential vacuum with ramp surcharge, and exponential vacuum with cyclic haversine surcharge. The results indicate that the gated architecture applied in Mod-PINN improves the resolution of steep pressure gradients near the cathode and smear-zone interface under constant vacuum loading. Under time-dependent loading, the soft-constrained Mod-PINN shows reduced accuracy because it must learn multiple competing objectives simultaneously. The Mod-HC-PINN mitigates this issue by embedding the cathode boundary and initial conditions into the output structure, thereby reducing the optimization burden and improving physical consistency. The Mod-HC-PINN achieves MAE values of 0.43, 0.41, and 0.27 kPa for the exponential vacuum, ramp surcharge, and cyclic surcharge cases, respectively. Sensitivity analyses further demonstrate that the proposed framework remains robust across practical ranges of network architecture, collocation density, and permeability contrast.
[AI-292] Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems
链接: https://arxiv.org/abs/2606.00052
作者: MD Shafikul Islam,Jordan Carden
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ “product-agnostic” or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a “blind spot” where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.
[AI-293] Business Utility of Large Language Models as Exploratory Data Analysis Agents
链接: https://arxiv.org/abs/2606.00051
作者: Rafał Łabędzki,Patryk Miziuła,Hubert Rutkowski,Szymon Betlewski,Cezary Depta,Szymon Janowski,Jarosław Kochanowicz,Jan Kanty Milczek
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) are increasingly used in analytical workflows, but their suitability as exploratory data analysis (EDA) agents in business settings remains uncertain. In practice, a deployable EDA agent must provide not only useful average performance but also sufficient repeatability to support trust in its outputs. We evaluate this requirement in a controlled, business-relevant benchmark built on an agent-based supply chain simulation. The task is to identify supplier-product combinations responsible for low quality and downstream sales loss by reasoning from indirect operational traces rather than from explicit labels. Fifteen model-variant configurations from eight model families were evaluated under four experimental conditions that varied data representation, prompt clarity, and signal strength, with five trajectories per condition. Outputs were scored against deterministic ground truth using the Jaccard index and assessed through a framework that combines mean score (ms), coefficient of variation (CV), exploratory cross-condition significance tests, and Business utility, a risk-adjusted metric that we propose to summarise quality and repeatability in a single operational measure. The results show that most configurations are not reliable enough for autonomous EDA use, even when their average scores appear acceptable. GPT-5.4 with extra-high reasoning effort achieved the strongest overall profile, with an experiment-averaged ms of 0.8748 and an experiment-averaged Business utility of 0.6952, while the next-best configurations lost substantially more utility after variability discounting. Our findings suggest that evaluation of EDA agents should treat average quality, repeatability, and condition sensitivity as complementary dimensions of operational trustworthiness.
[AI-294] Measuring and Mitigating Bias in Code Generated by Large Language Models
链接: https://arxiv.org/abs/2606.00049
作者: Yuxi Chen,Yutian Tang,Timothy Storer
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are widely recognised for their applications in natural language generation and are increasingly used for code generation tasks. However, concerns about bias in their generated outputs remain significant. This paper focuses on GPT-4o and Gemini, mainstream tools for code generation, and proposes a framework for evaluating bias in LLM-generated code, specifically examining the influence of protected attributes, prompts and web-search capability. We use two metrics: the code bias score (CBS) and the attribute change ratio (ACR), to quantify the prevalence of bias and the degree of influence of different attributes, respectively. In addition, we investigate four lightweight mitigation strategies: Few-Shot, Chain-of-Thought, Few-Shot Chain-of-Thought, and Multi-agent, aimed at mitigating bias in generated code. Our findings reveal that bias remains prevalent across different protected attributes and datasets even after applying mitigation strategies, highlighting the need for more effective approaches to reduce bias in AI-driven code generation systems.
[AI-295] Comprehensive AI governance requires addressing non-model gains ICML2026
链接: https://arxiv.org/abs/2606.00047
作者: Arthur Goemans,Dan Altman,Noemi Dreksler,Jonas Freund,Milan Gandhi,Zhengdong Wang,Sarah Cogan,Sebastien Krier,Demetra Brady,Lewis Ho,Allan Dafoe
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: This paper has been accepted to ICML 2026 (Position paper track): this https URL
Abstract:Frontier AI governance often centres on the model-level governance paradigm, which assumes that a model’s capability profile is primarily a function of the compute and data used during training. This position paper argues that model-level governance becomes less effective when capability progress is increasingly driven by “non-model gains”–improvements that are independent from advances in the base model. We formalise the concept of non-model gains and provide a taxonomy of three distinct vectors of capability gain: inference gain (scaling compute at test-time), systems gain (post-training enhancements such as scaffolds), and asset gain (enhancing a model with restricted assets). We demonstrate how these vectors–alongside potential future impacts from embodiment, continual learning, and AI diffusion–may undermine risk management strategies that hinge mostly on pre-deployment evaluation and mitigation. We provide an overview of governance approaches that go beyond the model level: system, entity, agent, and cloud governance. Finally, we emphasise the importance of societal resilience as a complement to these governance layers.
[AI-296] Universal Quantum Transformer
链接: https://arxiv.org/abs/2606.00045
作者: Sungyong Chung,Alireza Talebpour
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
备注:
Abstract:Classical continuous-space neural networks fundamentally struggle to lock into exact mathematical symmetries, such as modular arithmetic and non-commutative algebra. To approximate these discrete logical rules, they often rely on massive parameter scaling, resulting in stochastic instability even after delayed generalization phenomena known as grokking. Here, we introduce the Universal Quantum Transformer (UQT), a fundamentally novel, quantum-native computing architecture that uses the physical properties of multi-qubit systems as a universal inductive bias for exact mathematical and algebraic reasoning. Rather than translating classical neural mechanisms, our framework relies entirely on parameterized geometric phase embedding and SU(2) wave-interference. We demonstrate that the quantum attention circuit, operating on a highly compact 5-qubit substrate, perfectly learns two highly distinct formal classes: cyclic modular arithmetic ( \mathbbZ_11 ) and non-Abelian algebra (the S_4 permutation group). While classical attention-based networks exhibit stochastic instability at convergence, the UQT achieves mathematically exact, deterministic generalization. We refer to this phenomenon as crystallization: a step beyond the well-known phenomenon of grokking. Crucially, this framework yields massive computational and memory advantages by theoretically bypassing the quadratic bottleneck of classical self-attention, and by logarithmically compressing the required representation dimension to eliminate the massive over-parameterization inherent to classical networks. Finally, we deploy this architecture on noisy intermediate-scale quantum (NISQ) hardware, proving its viability on current IBM Quantum computers. These results establish parameterized quantum topology as a universally superior physical substrate for exact artificial intelligence.
[AI-297] Algorithmic Authority and the Clinical Standard of Care
链接: https://arxiv.org/abs/2606.00044
作者: Aizierjiang Aiersilan
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of artificial intelligence into clinical medicine creates a fundamental tension between algorithmic probabilistic reasoning and the experiential intuition of expert physicians; applying Lawrence Lessig’s \enquoteCode is Law framework, I argue that the architecture of clinical AI systems already functions as de facto medical regulation, reshaping liability and the standard of care. Reframing AI \enquotehallucination as structurally analogous to well-documented human cognitive failures such as confirmation bias and premature diagnostic closure, I show that both failure modes demand a unified governance response. I therefore propose a dialectical standard of care that treats the integrated AI-physician dyad as the singular responsible diagnostic entity, mandating the synthesis of algorithmic precision with human interpretive authority within robust data governance and patient privacy frameworks.
[AI-298] Improving Hospital Process Management through Process Mining: A Case Study on COVID-19 Clinical Pathways
链接: https://arxiv.org/abs/2606.00041
作者: Pasquale Ardimento,Mario Luca Bernardi,Marta Cimitile,Samuele Latorre
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This study analyzes COVID-19 care pathways using the COVID Data for Shared Learning dataset. We build a transparent, reproducible pipeline that transforms heterogeneous clinical tables into a process-mining-ready event log and applies discovery, declarative conformance checking, and outcome analysis. The reconstructed pathways highlight the monitoring backbone of inpatient care, variability at the Emergency department-admission interface, and outcome differences driven by age and exposure to intensive care units. These insights support triage standardization, capacity planning, and step-down coordination from intensive care units to lower-acuity wards, showing how process mining can inform evidence-based hospital governance.
[AI-299] racing GenAI Literacy: Uncovering Student-AI Interaction Patterns in Academic Writing through Epistemic Network Analysis
链接: https://arxiv.org/abs/2606.00040
作者: Angxuan Chen,Jiyou Jia
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:As Generative AI (GenAI) becomes integral to education, fostering GenAI literacy is critical. However, current assessments largely rely on self-reported scales, lacking insights into how literacy manifests in actual learning processes. This study leverages Learning Analytics (LA) to bridge this gap. We collected interaction logs from 162 university students engaged in a GenAI-assisted abstract writing task. Using Epistemic Network Analysis (ENA), we modeled and compared the questioning strategies of students with varying GenAI literacy levels. Preliminary results reveal distinct interaction signatures: high-literacy students engage in iterative refinement and strategic questioning, while low-literacy students rely on direct generation commands. This work contributes to the workshop by demonstrating how process data can characterize GenAI literacy, paving the way for data-driven literacy assessment and real-time interventions.
[AI-300] Beyond Tool Adoption: A Practical Five-Stage Developmental Continuum for AI Literacy in Higher Education
链接: https://arxiv.org/abs/2606.00038
作者: J. Paul Liu,Rachel Levy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 4 tables, 2 figures, 1 Supplementary Table
Abstract:Artificial intelligence (AI) literacy is increasingly recognized as a foundational competency for all university graduates. Yet students’ engagement with AI tools often clusters at two problematic extremes: avoidance driven by fear, mistrust, ethical concern, or lack of access, and uncritical reliance that produces fluent output while masking misunderstanding. Existing AI literacy frameworks provide valuable competency definitions, but most offer limited guidance for diagnosing where learners begin and how they progress toward responsible, critical engagement. This paper proposes a five-stage AI Literacy Continuum – 1) Not Yet Engaged, 2) Uncritical Use, 3) Informed Use, 4) Critical Evaluation, and 5) Improvement – that describes developmental orientations toward AI use in higher education. The continuum complements dimensional frameworks by providing educators with a practical diagnostic and instructional pathway aligned with international frameworks, including UNESCO and OECD. We present a design-based implementation case from North Carolina State University, where credit-bearing courses and intensive hands-on workshops engaged more than 330 participants between Fall 2024 and Spring 2026. Because the implementation did not use a validated pre/post instrument or comparison group, we frame the findings as observational and practice-based: participants exhibited behaviors consistent with movement from non-engagement or uncritical use toward informed engagement, while sustained and discipline-embedded experiences produced stronger evidence of critical evaluation and improvement-oriented practice. We discuss curricular pathways, equity considerations, assessment strategies, and argue that AI literacy should be understood not as tool adoption alone but as a developmental capacity to understand, evaluate, and responsibly apply AI systems in disciplinary and societal contexts.
[AI-301] Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing ACL2026
链接: https://arxiv.org/abs/2606.00033
作者: Michael Lan,Narmeen Fatimah Oozeer,Chaithanya Bandi,Philip Quirke,Austin Meek,Fazl Barez,Amirali Abdullah
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted at ACL 2026 main conference
Abstract:While mechanistic interpretability (MI) has produced important insights into neural network internals, the field has yet to establish a standardized system to audit experiments. As such, many of its findings remain underutilized in safety-critical applications such as medical AI and autonomous systems, as stakeholders cannot certify their validity. Recent work demonstrates this concretely: two papers found conflicting conclusions for the same behavior, and a third study revealed that both were partially correct but incomparable due to methodological inconsistencies. Without standardized auditing, such ambiguities hinder adoption in high-stakes contexts requiring strong correctness guarantees. We call for the MI community to work towards developing a novel reviewing system that complements peer review via: (1) Continuous reviewing supported by a \emphCollaborative Reviewing Platform where meta-science results and discussions (such as critiques, negative results, post-hoc extensions, reproductions, replications, and partial results) that fit outside of papers are organized and discussed, allowing for comments and revisions to be made at any time (2) Generalizing good practices found on this platform into expert-verified guidelines and protocols to improve auditing efficiency, and (3) Source-based auditing systems that track arguments which claims depend on. This position paper encourages constructive debate over the necessity, design and implementation of such a framework, providing early concrete examples to help catalyze these dialogues. Overall, we propose that auditing MI itself is essential for its application in AI safety, industry, and governance.
[AI-302] Optimal Transport-based Permutation-Invariant Bayesian Optimization of Offshore Wind Farm Layouts
链接: https://arxiv.org/abs/2606.00009
作者: Antonio Candelieri,Laurens Bliek
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Bayesian Optimization (BO) is widely and successfully adopted for solving optimization problems having an expensive-to-evaluate, black-box, and non-convex objective function. However, the vanilla BO algorithm is not able to exploit possible symmetries characterizing the target problem. An intuitive case is given by optimal location problems, whose decision variables refer to a finite set of points within a continuous space, with the order of points not affecting the value of the objective function. We refer to this setting as optimization over layouts to distinguish from optimization over point-clouds where, instead, the order of points counts. As an instance of optimization over layouts we consider a real-life industrial-relevant application, that is the optimization of the layout of an offshore wind farm: given identical wind turbines, switching any pair of them has not any effect on the annual energy production. Based on Optimal Transport theory, we propose a Permutation-Invariant BO approach, namely PIBO, proved to provide better wind farm layouts when compared to the vanilla BO approach while cutting computation time roughly in half.
[AI-303] Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization
链接: https://arxiv.org/abs/2606.00008
作者: Jia Zhang,Tengfei Ma,Tianle Li,Daojian Zeng,Xieping Gao,Xiangxiang Zeng
类目: Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures
Abstract:Multi-objective molecular optimization requires searching vast chemical spaces under conflicting objectives, where early design decisions strongly constrain downstream outcomes. Existing methods typically rely on a single policy or fixed scalarization, which limits their ability to represent diverse trade-offs and to explore multiple promising design trajectories. We propose ATOM, a multi-agent framework that formulates molecular optimization as a tree-structured search. Each node corresponds to an atomic operation and hosts an agent specialized for a particular objective or decision context. Agents coordinate along different paths of the tree rather than enforcing a global consensus, enabling the method to maintain and compare alternative molecular evolution trajectories. A global memory of past optimization behaviors further supports balanced exploration and exploitation across objectives. This tree-structured interaction enables reasoning over long-horizon dependencies inherent in molecular design. Experiments on challenging multi-objective benchmarks involving activity, synthesizability, and ADMET-related properties show that ATOM consistently achieves improved Pareto coverage and hypervolume over strong baselines. These results demonstrate the effectiveness of pathwise multi-agent coordination for molecular optimization. Code is available at this https URL.
[AI-304] Deliberative Curation: A Protocol for Multi-Agent Knowledge Bases
链接: https://arxiv.org/abs/2606.00007
作者: Steven Johnson
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 1 figure, 6 tables. Open-source implementation available at this https URL
Abstract:As AI agents transition from isolated tools to collaborative participants in shared knowledge ecosystems, governing collective knowledge curation becomes a critical challenge. Human platform governance mechanisms do not transfer directly: agent statelessness undermines deterrence-based sanctions, model homogeneity violates independence assumptions underlying crowd wisdom, and sycophancy collapses deliberative consensus. We propose a deliberative curation protocol combining three governance layers: (1) a knowledge artifact lifecycle formalized as a labeled transition system; (2) reputation-weighted deliberative voting integrating Beta Reputation with EigenTrust amplification; and (3) graduated sanctions adapted for stateless agents, including broken agent handling distinguishing malfunction from adversarial behavior. We evaluate the protocol through agent-based simulation with 100 agents across seven behavioral archetypes under two adversity scenarios (30 seeds, paired t-tests). The protocol trades modest precision under benign conditions for substantially better resilience under adversity: 0.826 vs 0.791 for majority vote under moderate adversity (p0.001), widening to 0.807 vs 0.740 under stress (p0.001). The protocol degrades roughly three times more slowly than majority vote. Ablation analysis identifies commit-reveal vote concealment as the most impactful single component (8.2-8.6pp precision improvement, p0.001), outperforming reputation weighting and deliberation combined. Graduated sanctions were not exercised in simulation and remain empirically unvalidated. Comments: 29 pages, 1 figure, 6 tables. Open-source implementation available at this https URL Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.11; H.3.4; K.4.3 Cite as: arXiv:2606.00007 [cs.AI] (or arXiv:2606.00007v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.00007 Focus to learn more arXiv-issued DOI via DataCite
[AI-305] Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis
链接: https://arxiv.org/abs/2606.00005
作者: VD Doske
类目: Artificial Intelligence (cs.AI)
备注: 32 pages, 7 figures
Abstract:We present the Consilium Protocol, a Byzantine Fault Tolerance-derived architecture for structured multi-model AI deliberation that treats inter-model disagreement as epistemic signal rather than error. The protocol assigns engineered cognitive personas to language models – separating what a model is from how it reasons – and introduces an In-Sample/Out-of-Sample validation framework adapted from quantitative finance to distinguish training-data consensus from empirically grounded conclusions. Across 1,478 deliberation sessions spanning 32 topics in 10 domain categories, we demonstrate that (1) the cognitive persona, not the underlying model, determines epistemic behavior: free edge-inference models costing 0.0002 USD per batch produced comparable analytical output to frontier models costing 10.69 USD; (2) RLHF alignment training creates measurable, domain-specific epistemic blind spots – contested policy topics exhibit 12.3 percentage points less adversarial challenge than settled science topics, and AI safety topics show asymmetric bias ( \Delta =11.6%) where models challenge claims that AI is dangerous far more vigorously than claims that AI risk is overstated; (3) the protocol exhibits no directional bias of its own (immigration \Delta =2.3%, renewables \Delta =1.2%); and (4) out-of-sample evidence retrieval validated 239 claims with 100% evidence retrieval and surfaced 167 blind-spot discoveries invisible to training-data deliberation. Run-to-run reproducibility across randomized model \times persona assignments averages \pm 2.2% standard deviation. Total cost for the complete battery including all overhead: 217 USD. We release the protocol specification under MIT license to enable independent verification.
[AI-306] Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations
链接: https://arxiv.org/abs/2606.00002
作者: Yi-Xiang Hu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mixed-Integer Linear Programming (MILP) decision engines routinely output nominally optimal plans for high-stakes industrial systems. Yet deployment rarely matches solve-time assumptions: small perturbations in costs, demands, or resource availability can invalidate feasibility or trigger discontinuous shifts to qualitatively different solutions. We argue that this post-solve robustness gap is a missing layer in today’s optimization pipelines and a missing evaluation dimension for learning-enabled decision systems. Rather than replacing robust optimization or stochastic programming, the proposed layer audits a solved incumbent and returns solver-backed evidence about how far that solution can be trusted. We formalize two central objects: (i) an \epsilon -near-optimal feasible neighborhood in parameter space, capturing when an incumbent remains feasible and near-optimal under perturbations, and (ii) solution smoothness in decision space, capturing whether nearby alternatives with small combinatorial edits remain competitive. We then synthesize the most relevant partial answers from sensitivity and stability analysis, robust optimization, neighborhood search, adversarial testing, and learning-based enhancements, and articulate an agenda for a unified post-solve robustness layer. Concretely, we call for certified inner approximations around the incumbent, probabilistic robustness estimation with calibrated uncertainty, adversarial robustness margins, and learning-based prediction and explanation aligned with solver-backed verification. We conclude with a compact reporting template and evaluation protocol that would make robustness a first-class output of decision engines.
[AI-307] A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks
链接: https://arxiv.org/abs/2507.19702
作者: Mohammed A. Ramadhan,Abdulhakeem O. Mohammed
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, existing approaches often face trade-offs between accuracy and computational efficiency. To address these challenges, we propose 1D-CGS, a lightweight and effective hybrid model that integrates the speed of one-dimensional convolutional neural networks (1D-CNN) with the topological representation power of GraphSAGE for efficient node ranking. The model uses a lightweight input representation built on two straightforward and significant topological features: node degree and average neighbor degree. These features are processed through 1D convolutions to extract local patterns, followed by GraphSAGE layers to aggregate neighborhood information. We formulate the node ranking task as a regression problem and use the Susceptible-Infected-Recovered (SIR) model to generate ground truth influence scores. 1D-CGS is initially trained on synthetic networks generated by the Barabasi-Albert model and then applied to real world networks for identifying influential nodes. Experimental evaluations on twelve real world networks demonstrate that 1D-CGS significantly outperforms traditional centrality measures and recent deep learning models in ranking accuracy, while operating in very fast runtime. The proposed model achieves an average improvement of 4.73% in Kendall’s Tau correlation and 7.67% in Jaccard Similarity over the best performing deep learning baselines. It also achieves an average Monotonicity Index (MI) score 0.99 and produces near perfect rank distributions, indicating highly unique and discriminative rankings. Furthermore, all experiments confirm that 1D-CGS operates in a highly reasonable time, running significantly faster than existing deep learning methods, making it suitable for large scale applications.
[AI-308] Evolutionary Discovery of Bivariate Bicycle Codes with LLM -Guided Search
链接: https://arxiv.org/abs/2606.02418
作者: Juan Cruz-Benito,Andrew W. Cross,David Kremer,Ismael Faro
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Quantum LDPC code discovery requires searching large algebraic design spaces while reliably certifying the parameters and equivalence classes of any candidates found. We introduce an LLM-guided evolutionary workflow in which language models mutate Python programs that generate bivariate-bicycle and perturbed bivariate-bicycle code ansätze. Across five campaigns, the system performed approximately 1,650 evolutionary iterations, screened about 2 \times 10^5 candidate codes, and required \sim140 hours of computation and \sim US\ 400 in LLM inference cost. Candidate codes are evaluated through a staged validation pipeline combining \mathrmGF(2) rank computation, distance estimation and certification, mixed-integer linear programming, BLISS Tanner-graph deduplication, decomposability analysis, and local-Clifford equivalence checks. At block length n \leq 360 , the workflow identifies 465 distinct candidate codes: 97 CSS bivariate-bicycle codes and 368 non-CSS perturbed variants. The CSS search recovers known high-performing codes and finds new finite-length representatives, including an indecomposable [[288,16,12]] code and higher-weight codes with up to k = 50 at distance d = 8 . The non-CSS search produces perturbed codes matching the gross-code figure of merit at [[144,12,12]], along with additional high-distance candidates reported as certified values or upper bounds according to MILP status. Overall, these results show that LLM-guided program evolution can serve as a practical tool for structured quantum-code discovery when paired with independent evaluation.
[AI-309] RA-LWLM: Retrieval-Augmented In-Context Localization with Wireless Foundation Models
链接: https://arxiv.org/abs/2606.01899
作者: Guangjin Pan,Hui Chen,Hei Victor Cheng,Henk Wymeersch
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 13 pages, 9 figures. This work has been submitted to the IEEE for possible publication
Abstract:Wireless localization is a fundamental capability of sixth-generation (6G) networks. Conventional model-based methods require accurate modeling of the propagation environment and degrade in complex multipath and non-line-of-sight scenarios, while learning-based methods couple model parameters tightly to the training scene, requiring costly retraining whenever the base station (BS) configuration or propagation environment changes. In this paper, we propose RA-LWLM, a retrieval-augmented in-context localization framework that achieves training-free cross-scene adaptation by externalizing scene-specific information into a per-scene fingerprint database rather than encoding it in model weights. The framework consists of three components: a frozen wireless foundation model (FM) encoder that maps raw channel state information into a scene-agnostic representation; a retrieval module that selects the most informative references from the per-scene database via similarity search in the representation space; and a transformer-based in-context learning (ICL) module that fuses the query with the retrieved references to predict the user equipment (UE) position. To accommodate varying retrieval quality and propagation complexity across queries, the ICL module adopts a mixture-of-experts design in which experts specialize in different context sizes and are softly combined by a learnable selector. Extensive ray-tracing-based experiments across heterogeneous scenes with diverse BS configurations show that RA-LWLM achieves nearly identical accuracy on seen and unseen scenes without any per-scene retraining, substantially outperforming end-to-end and FM-based baselines. These results validate the proposed retrieval-augmented in-context paradigm as a scalable solution for cross-scene localization in 6G networks.
[AI-310] MINTS: Minimalist Thompson Sampling
链接: https://arxiv.org/abs/2606.01655
作者: Kaizheng Wang
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: 29 pages
Abstract:The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai–Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.
[AI-311] Demystifying Multimodal Biomolecular Co-design With Intrinsic Geodesic Coupling ICML2026
链接: https://arxiv.org/abs/2606.01628
作者: Keyue Qiu,Xintong Wang,Zhilong Zhang,Hao Zhou,Wei-Ying Ma
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Biomolecules such as proteins and small-molecule ligands play a central role in biological systems, arising from the tight interplay between sequence and three-dimensional structure. Recent generative models for biomolecular co-design aim to capture this interplay by jointly modeling coupled modalities. However, existing approaches largely adopt a parallel execution of marginal generative processes, implicitly enforcing fixed synchronous coupling. We argue that a critical but overlooked degree of freedom lies in how these marginal processes are temporally coupled during training and generation, where inappropriate coupling can introduce high-variance supervision and inconsistent intermediate states, affecting modality consistency. To address this, we introduce GeoCoupling, a systematic framework that optimizes for temporal couplings between heterogeneous modalities. Empirical results across structure-based drug design and unconditional protein design demonstrate the learned couplings consistently outperform synchronous and randomly coupled baselines, yielding biomolecules with improved physical validity and diversity.
[AI-312] Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence
链接: https://arxiv.org/abs/2606.01470
作者: Payel Mukhopadhyay,Stefan S. Nixon,Romain Watteaux,Michael McCabe,Alberto Bietti,Kyunghyun Cho,Cristiana Diaconu,Irina Espejo Morales,David Fouhey,Siavash Golkar,Tom Hehir,Shirley Ho,Jake Kovalic,Geraud Krawezik,Francois Lanusse,Tanya Marwah,Rudy Morel,Mariel Pettee,Helen Qu,Jeff Shen,Hadi Sotoudeh,Stuart B. Dalziel,Miles Cranmer
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Whether physics foundation models can be usefully deployed on laboratory experiments remains an open question for scientific machine learning (ML). We test this question on the Rayleigh-Taylor instability (RTI), a ubiquitous and demanding fluid instability seen from tabletop flows to supernova explosions, in which small perturbations at a density interface grow into chaotic, multiscale mixing as a lighter fluid accelerates into a heavier one. Standard ML models struggle with RTI, and despite over a century of theoretical, numerical, and experimental work, it carries an unresolved discrepancy between simulation and experiment: the late-time mixing growth rate, \alpha , measured in most laboratory experiments ( \sim 0.06-0.07), is roughly three times the value from idealized direct numerical simulations (DNS, \sim 0.02). The gap’s origin remains debated. These properties make RTI a stringent test for a question that matters well beyond RTI: can foundation models trained only on simulations generalise to sparse, messy, and noisy laboratory settings? We finetune Walrus, a foundation model for continuum dynamics, on three or fewer DNS realizations and recover key RTI physics over long rollouts. Applied zero-shot to sliding-barrier laboratory data, the finetuned model leaves the DNS-like regime and enters the observed growth band, having never seen a single experimental sample. These results provide independent, data-driven evidence that initial conditions play a crucial role in the longstanding sim-experiment gap in \alpha . The model also generalises zero-shot to stable stratification, a buoyancy regime absent from training, correctly slowing mixing-layer growth. Together, our results show that foundation models can generalise well beyond their training data, predicting laboratory behavior and unseen physical regimes, opening new ways to probe longstanding simulation-experiment gaps.
[AI-313] Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics
链接: https://arxiv.org/abs/2606.01468
作者: JR Huml,Jonathan Wenger,John P. Cunningham
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, Proceedings of 2nd International Conference on Probabilistic Numerics (2026)
Abstract:Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable modeling of single-cell neural recordings. However, modern-sized datasets have made overparameterized deep networks the preferred methods of choice due to their predictive power and favorable computational scaling. While many posterior approximations exist, all incur approximation errors. Recent work accounts for this error in the form of computational uncertainty but comes at the cost of quadratic complexity and assumes fixed model hyperparameters. Here we extend this development to model selection, including a novel training loss and optimization scheme, which yields tractable inference in large state-spaces. We introduce a framework, the Computation-Aware State-Space Model (CASSM), specifically designed for the scale-imbalanced regime, where the number of trials is significantly lower than the number of recorded neurons. In this regime, for both synthetic and real data, we show that our method is competitive with data-hungry deep networks, with significantly improved uncertainty calibration over previous attempts to scale Bayesian methods. Our experiments provide a roadmap to neuroscience researchers in choosing from a host of potential dynamical latent variable models given key dataset properties and constraints.
[AI-314] A Communication-Centric 6G-LLM Architecture for Scalable Tactical Autonomous Defense Vehicle Networks
链接: https://arxiv.org/abs/2606.01312
作者: Kiran Khurshid,Shumaila Javaid,Nasir Saeed
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 10 pages, accepted in IEEE Network Magazine
Abstract:The integration of Artificial Intelligence (AI) and emerging 6G networks introduces new opportunities for scalable coordination in tactical autonomous vehicle systems. This paper proposes a communication-centric hierarchical architecture for Tactical Autonomous Defense Vehicle Networks (TADVNs) that models the integration of edge-assisted Large Language Model (LLM) reasoning with 6G-enabled connectivity and semantic communication. The framework is designed to improve coordination efficiency, reduce communication overhead, and enhance latency resilience under increasing fleet-scale operation. Unlike conventional task-specific AI pipelines that rely on structured feature processing and rule-based coordination, the proposed approach incorporates semantic abstraction and context-aware decision support within a layered edge-cloud communication architecture. We evaluate communication and coordination performance via Monte Carlo simulations across fleet sizes of 5-30 vehicles under contested network conditions. Results indicate that at a 30-vehicle scale, the 6G-LLM configuration achieves 75.2% latency reduction (29.1 ms vs. 117.5 ms), a 68.7 percentage point increase in mission success rate (82.9% vs. 14.2%), and an 88.6% reduction in communication overhead compared to a 5G-based conventional AI baseline. These findings demonstrate measurable benefits in coordination and communication when semantic reasoning is combined with low-latency 6G connectivity.
[AI-315] Quantum Algorithm for Distributed Reduction of Entanglements (QADR): A Trainable and Simulation-Efficient QML Framework
链接: https://arxiv.org/abs/2606.01291
作者: Syed Farhan Ahmad,Gregory T. Byrd
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:Training Variational Quantum Circuits (VQCs) under Noisy Intermediate-Scale Quantum (NISQ) constraints introduces severe computational limitations: classical statevector simulation memory scales exponentially ( \mathcalO(2^n) ), and global cost functions suffer from barren plateaus where gradient variance decays exponentially ( \mathcalO(1/2^n) ). This paper introduces and evaluates the Quantum Algorithm for Distributed Reduction of Entanglements (QADR), a hybrid quantum-classical machine learning framework that decomposes a global n -qubit VQC into localized sub-circuits operating approximately within the causal light cones of individual target qubits. QADR reduces classical simulation memory scaling from \mathcalO(2^n) to \mathcalO(n \cdot 2^2d+1) for a light cone radius d , while naturally mitigating global barren plateaus. We benchmark QADR against standard global VQCs, Support Vector Machines (SVM), and two customized classical parameter-matched neural networks (CANN and PMNN) on the MNIST dataset and the high-dimensional NASA IMS wind turbine drivetrain diagnostic task. QADR demonstrates excellent scalability, operating successfully at n_\textfeatures=2000 where standard global VQCs crash due to memory exhaustion, while matching or exceeding the performance of optimized classical architectures.
[AI-316] opological Ignorability for Structural Causal Effects Beyond Means
链接: https://arxiv.org/abs/2606.01184
作者: Usef Faghihi
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI)
备注:
Abstract:Many interventions alter the structure of an outcome distribution rather than its mean: they can split a population into disconnected regimes, create loops or holes, generate branches, or reorganize an outcome cloud while leaving the average response nearly unchanged. In such settings, mean-based causal estimands such as the average treatment effect may miss important structural effects. We introduce topological-geometrical causal metrics based on summaries of interventional outcome laws, including density-superlevel Betti summaries, Euler signatures, and persistent-homology summaries. These metrics quantify structural differences between treated and untreated outcome laws beyond averages. We also study the assumptions needed for causal interpretation. We introduce topological ignorability, a topological analogue of conditional ignorability that requires invariance of the chosen structural feature rather than the full counterfactual distribution. When the chosen summary is injective, this condition coincides with weak ignorability; for noninjective summaries, it can identify the structural feature of interest without identifying the full interventional law. We define a covariate-standardized topological-geometrical causal effect and develop practical estimators. We validate the framework in two hidden-confounding benchmarks: a fully synthetic exact benchmark and a real-covariate semi-synthetic benchmark using Wisconsin breast-cancer covariates. In both, weak ignorability fails and balancing observed covariates nearly eliminates standardized mean differences, yet the coordinate-mean average treatment effect remains biased. By contrast, selected finite density-superlevel Betti and Euler contrasts remain stable across oracle, observational, and weighted analyses. Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.01184 [stat.ME] (or arXiv:2606.01184v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2606.01184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-317] Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing
链接: https://arxiv.org/abs/2606.00834
作者: T. Ansah-Narh,Y. Asare Afrane,J. Bremang Tandoh
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
备注: 24 pages, 8 figures, accepted for publication in Artificial Intelligence in Medicine
Abstract:Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved R^2 = 0.9906 versus 0.8213 for Holt-Winters alone, with 94.2% of residuals within \pm 2\sigma bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8,000 to 12,200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana’s national malaria control strategy.
[AI-318] Certificates without Electrons? Theory and Evidence on Impacts from AI-Driven Power Demand
链接: https://arxiv.org/abs/2606.00811
作者: Dana Golden,Aruna Balasubramanian,Niranjan Balasubramanian
类目: Econometrics (econ.EM); Artificial Intelligence (cs.AI)
备注:
Abstract:Data centers now account for 4.4% of United States electricity demand, yet the grid-level effectiveness of the renewable energy certificates (RECs) and power purchase agreements (PPAs) hyperscalers use to claim carbon neutrality remains unclear. We develop a game-theoretic model in which a data center operator chooses among RECs, PPAs, and behind-the-meter colocation while generators make entry decisions under endogenous financing costs. The model identifies a timing wedge – the mismatch between consumption and credited renewable generation – as a central mechanism through which AI demand degrades reliability, raises prices, and increases emissions even when RECs cover 100% of annual consumption. Colocation with storage addresses this wedge directly and induces the greatest renewable entry by eliminating generator revenue risk. We test these predictions by exploiting the staggered release of large language models as a natural experiment, using difference-in-differences on a novel dataset linking AI activity to local grid outcomes. AI demand significantly increases fossil generation, wholesale prices (up to 25% in treated PJM zones), and outage frequency (0.5–1 additional outages per year) near data centers, with impacts scaling in model size. Data centers with on-site generation exhibit a sign reversal in power-quality effects, consistent with the model’s prediction that behind-the-meter capacity absorbs demand spikes. Counterfactual analyses show that edge inference, spatial reallocation, and colocated storage each substantially mitigate grid impacts, while REC-only strategies do not. Together, our results demonstrate that the externalities of AI to the grid are tightly coupled to procurement design and the spatial organization of data center infrastructure.
[AI-319] Bayesian Inference of Nonlinear Malaria Dynamics in Ghana via an Ensemble Markov Chain Monte Carlo Sampler
链接: https://arxiv.org/abs/2606.00783
作者: T. Ansah-Narh,Y. Asare Afrane,J. Bremang Tandoh
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Probability (math.PR); Computation (stat.CO)
备注: 27 pages, 15 figures, published in Expert Systems with Applications
Abstract:Reliable quantification of malaria dynamics in sub-Saharan Africa is hindered by short, noisy, and spatially heterogeneous surveillance records. In Ghana, health-facility data from 2014 to 2023 reveal non-linear and age-specific fluctuations in hospital admissions, yet existing approaches struggle to capture stochastic variability or provide credible uncertainty bounds. This study develops a Bayesian nonlinear inference framework that integrates a cubic baseline with a damped oscillatory kernel, estimated via an affine-invariant ensemble Markov Chain Monte Carlo sampler. The framework accommodates limited data, models parameter uncertainty, and generates probabilistic forecasts for children under five years and individuals aged five years or more. Results show strong empirical adequacy ( R^2 = 0.9958 for 5 years; R^2 = 0.9956 for \geq 5 years) with residual errors below 2% and well-mixed posteriors confirming convergence. District-level analysis reveals pronounced spatial heterogeneity, with coefficients of variation ranging from 0.07 in urban centres such as Kumasi to 3.3 in peripheral districts such as Mpohor and Bia East. Forecasts for 2024-2026 indicate a gradual resurgence: from 137,000 to 149,000 cases among children under five years and from 348,000 to 375,000 cases among older individuals, with uncertainty widening over time. By producing probabilistic forecasts, this Bayesian framework provides a principled tool for anticipating malaria fluctuations and strengthening data-driven decision-making in Ghana’s national malaria control strategy.
[AI-320] Causal Density Functions
链接: https://arxiv.org/abs/2606.00754
作者: Sridhar Mahadevan
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages
Abstract:We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act as local density ratios for causal effects. Whereas many causal-strength measures compare whole distributions after graph surgery, causal density functions provide a pointwise change-of-measure object that can be estimated, calibrated, and used to score directed influence. The basic identity [ \mathbbE_\mathrmdo[f(Y)] = \mathbbE_\mathrmobs!\left[f(Y)\rho(X,Y)\right] ] makes causal density directly testable: if the estimated density ratio is correct, observational expectations reweighted by \rho reproduce interventional expectations. We derive practical estimators for do-curves and directed edge scores, relate the construction to Radon-Nikodym/Kan semantics for conditioning and intervention, and evaluate the resulting estimators on synthetic and real perturbation benchmarks. Comments: 25 pages Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2606.00754 [stat.ME] (or arXiv:2606.00754v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2606.00754 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sridhar Mahadevan [view email] [v1] Sat, 30 May 2026 14:41:25 UTC (814 KB) Full-text links: Access Paper: View a PDF of the paper titled Causal Density Functions, by Sridhar MahadevanView PDFHTML (experimental)TeX Source view license Current browse context: stat.ME prev | next new | recent | 2026-06 Change to browse by: cs cs.AI cs.LG stat References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status
[AI-321] A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering
链接: https://arxiv.org/abs/2606.00402
作者: Yi Liu
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:
Abstract:We propose a distribution-free statistical framework that converts arbitrary rewrite-based detectors into detectors with finite-sample FDR guarantees without retraining. Our key observation is that rewrite-based detection implicitly constructs knockoff samples, enabling LLM-generated text detection to be formulated as a multiple hypothesis testing problem with knockoff structure. This perspective separates the design of detection statistics from the control of false discoveries, allowing existing rewrite detectors to inherit finite-sample false discovery rate (FDR) guarantees through a simple calibration procedure. We demonstrate reliable FDR control with meaningful detection power across three detection models, 19 domains, and four LLMs.
[AI-322] Interpreting FCDNNs via RG on Exponential Family
链接: https://arxiv.org/abs/2606.00157
作者: Fuzhou Gong,Zigeng Xia
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Probability (math.PR)
备注: 18 pages, 2 figures
Abstract:We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renormalization group (RG) method in statistical physics and the training process of deep neural networks (DNNs). We have proved the constructed relationship using the one-dimensional Ising model as the input data. In this paper we generalize our results to the case of continuous input data, which is a necessary preparation for applying the corresponding framework to real-world data. To be representative, we consider a class of data distribution in the exponential family. We prove that when the parameters of fully connected (FC) DNNs achieve their optimal value after training, the characteristic parameters of the feature layer output of DNNs are equal to the fixed points of the characteristic parameters of input data under RG method for continuous fields. This conclusion shows that the training process of DNNs is equivalent to RG calculation on this kind of data and therefore the network can extract main features from the input data just like RG. Also, the equivalence further validates the correspondence framework we have established, providing an explanation for the outstanding performance of DNNs on real-world data.
[AI-323] A physics-informed foundation model for quantitative diffusion MRI
链接: https://arxiv.org/abs/2606.00156
作者: Zihan Li,Jialan Zheng,Ziyu Li,Xun Yuan,Kasidit Anmahapong,Ziang Wang,Mingxuan Liu,Hongjia Yang,Yifei Chen,Zhuhao Wang,Yuhang He,Fang Chen,Rui Li,Huaiqiang Sun,Yi Liao,Congyu Liao,Yang Yang,Haibo Qu,Xue Zhang,Hongen Liao,Qiyuan Tian
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Understanding the human brain requires access to its microscopic tissue architecture. Diffusion magnetic resonance imaging (MRI) provides the only noninvasive window into whole-brain microstructure in vivo, yet reliable quantitative mapping remains confined to specialized research settings requiring dense sampling and optimized acquisition protocols. To address this gap, we present a physics-informed generative microstructure network (PIGMENT) that learns a universal generative prior of human brain microstructure and adapts it zero-shot to each participant’s measured data to recover subject-specific maps. Trained on 11375 scans spanning multiple sites, vendors, and field strengths, PIGMENT enabled reliable quantitative mapping for tensor, kurtosis, and NODDI models across external datasets from five independent centers. It remains effective where conventional fitting becomes unreliable, recovering meaningful maps from extremely sparse acquisitions while supporting downstream tractography and structural connectivity mapping. PIGMENT estimates demonstrated strong biological validity, preserving submillimeter cortical microarchitectural patterns and early-childhood white matter developmental trajectories from 10-fold accelerated scans. Furthermore, PIGMENT enables reliable quantitative tensor mapping on cost-efficient low-field systems and the extraction of tumor-related biomarkers using ultra-fast clinical protocols. Together, these results establish PIGMENT as a physics-informed foundation model that extends quantitative diffusion MRI into regimes traditionally too sparse, heterogeneous, or clinically constrained for reliable analysis.
[AI-324] Regime-Adaptive Continual Learning for Portfolio Management KDD2026
链接: https://arxiv.org/abs/2606.00143
作者: Chaofan Pan,Lingfei Ren,Linbo Xiong,Yonghao Li,Wei Wei,Xin Yang
类目: Portfolio Management (q-fin.PM); Artificial Intelligence (cs.AI)
备注: Accepted by KDD 2026
Abstract:Financial markets are inherently non-stationary, exhibiting frequent regime shifts and structural changes that render traditional Portfolio Management (PM) approaches ineffective. Existing remedies, such as rolling-window retraining and naive online fine-tuning, are hindered by high computational costs and insufficient knowledge utilization, respectively, resulting in low returns and limited adaptability. Continual learning (CL) offers a promising paradigm by enabling trading agents to accumulate and transfer knowledge across sequential tasks. In this paper, we propose \textbfRegime-aware \textbfContinual \textbfAdaptive \textbfPortfolio management (\textbfReCAP), a novel framework that integrates CL into PM to address the challenges of dynamic financial environments. ReCAP employs an adaptive regime detection module to segment historical market data into variable-length regimes, enabling regime-specific learning of policy vectors and the construction of a policy library. During continual trading, a regime-gate module adaptively combines policy vectors from the library based on the current market state, facilitating rapid adaptation to newly detected regimes. Only the regime-gate and the current regime’s policy vector are continually updated to preserve useful knowledge effectively. Extensive experiments on five real-world datasets demonstrate that ReCAP consistently outperforms popular baselines, achieving superior returns in long-term investment horizons and rapid adaptation to regime shifts.
[AI-325] SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction
链接: https://arxiv.org/abs/2606.00120
作者: Liwen Jing,Yisha Lu,Tingting Yang,Li Sun,Yuxuan Shi,Yuwei Wang,Mengfan Zheng,Leiyang Xu
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neural network (ANN)-based transformers for wireless foundation models (WFMs). Inspired by the noise-robust and energy-efficient information processing in the human brain, SpikeWFM aims to enhance the resilience of WFMs against noise and interference while maintaining strong generalization capabilities across diverse wireless scenarios. Drawing from the success of large language models, WFMs leverage self-supervised pre-training on large-scale datasets spanning various wireless environments to learn a unified embedding that supports a wide range of downstream tasks, including channel prediction, channel estimation, beam predition, positioning and etc. Such models typically outperform task-specific designs and exhibit superior adaptability to unseen conditions. However, existing WFMs remain vulnerable to realistic noise and interference in practical wireless systems. To address this limitation, we incorporate spiking neurons into the transformer-based WFM architecture. We provide a brief theoretical analysis demonstrating how the SNN-ANN hybrid effectively mitigates noise and interference through temporal sparsity and event-driven processing. Experimental results show that SpikeWFM consistently outperforms conventional ANN-based WFMs in both pre-training convergence and channel prediction accuracy. Additional results on communication and sensing tasks will be presented in the full journal version of this work.
[AI-326] Project SPARROW and the Future of Conservation Technology
链接: https://arxiv.org/abs/2606.00108
作者: Juan M. Lavista Ferres,Carl Chalmers,Bruno Demuro Segundo,Zhongqi Miao,Andres Hernandez Celis,Federico Alves Torres,Isai Daniel Chacon Silva,Anthony Cintron Roman,Allen Kim,Meygha Machado,Luana Marotti,Amy Michaels,Daniela Ruiz Lopez,Catherine Romero,Rahul Dodhia,Inbal Becker-Reshef,Pablo Arbelaez
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Global biodiversity is declining at unprecedented rates, yet the tools available to monitor and protect ecosystems remain limited by constraints in power, connectivity, and accessibility. We present SPARROW, a hardware and software open-source platform that integrates solar energy, edge artificial intelligence, and satellite communication to enable continuous, autonomous biodiversity monitoring in remote environments. Each SPARROW node combines a low-power Graphics Processing Unit (GPU) with modular visual, acoustic, and environmental sensors, performing on-device deep learning inference and transmitting summarized results through Low-Earth-Orbit (LEO) satellite or Global System for Mobile Communications (GSM) networks. We deployed SPARROW across tropical, temperate, and montane ecosystems in Colombia, Peru, Tanzania, and the United States, where it sustained 24/7 operation under variable environmental conditions and collected more than two million images and acoustic recordings in the first 190 days. The system demonstrated robust real-time classification and adaptive power management, achieving full autonomy without on-site human intervention. By integrating renewable energy, on-edge AI, and open-source design, SPARROW lowers the technical and financial barriers to ecological monitoring and establishes a scalable foundation for a distributed, intelligent network of sensors, an emerging “Internet of Living Things” for planetary biodiversity monitoring. Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.00108 [eess.SP] (or arXiv:2606.00108v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2606.00108 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-327] Motif-based morphology signatures for interpretable ECG screening and monitoring
链接: https://arxiv.org/abs/2606.00107
作者: Nivedita Bijlani,Mauricio Villarroel
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026
Abstract:Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical practice relies on brief resting ECGs and, when required, long-duration ambulatory recordings, both generating data that require resource-intensive review. Consequently, subtle morphological changes or progressive drift preceding clinically apparent abnormalities may go unnoticed. We propose a motif-based framework that defines beat-aligned ECG motifs as interpretable cardiac signatures and quantifies morphological drift and deviation across short and long-term monitoring. Motifs are representative cardiac cycles capturing dominant morphology. We introduce three interpretable drift metrics: deviation from a normal sinus rhythm (NSR), deviation from a personalised baseline, and a motif instability index. Motifs are extracted by selecting beats that minimise Dynamic Time Warping (DTW) distance within fixed windows. We evaluate these metrics on short (PTB-XL) and long-duration (MIT-BIH Arrhythmia) ECG datasets. Interpretability is achieved through representative motif overlays and fiducial-based visualisations, enabling direct inspection of morphological changes. In MIT-BIH, the proposed metrics significantly separated predominantly normal from arrhythmic subjects (p0.01). In PTB-XL, NSR deviation distinguished normal from abnormal ECGs across major diagnostic subtypes (p1e-4, Cliff’s delta up to 0.93). ECG motifs provide an interpretable representation of cardiac morphology, supporting scalable longitudinal monitoring and early detection of morphology-driven change.
[AI-328] CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention
链接: https://arxiv.org/abs/2606.00074
作者: Mufeng Chen,Qi Wu,Bingchao Huang,Xiwen Lai,Zekai Chen,Xinge Ouyang,Quansheng Ren
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 27 pages, 8 figures, submitted to Biomedical Signal Processing and Control
Abstract:Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variability in EEG signal quality encountered in real-world deployment, and the overwhelming majority adopt non-strict evaluation protocols that overestimate generalisation performance. We propose CLSP-REQA (Closed-Loop Seizure Prediction with Real-time EEG Quality Assessment), a unified framework that embeds a lightweight signal quality estimator directly within the prediction pipeline. A Real-time EEG Quality Assessment (REQA) module runs in parallel with a Mamba-BiLSTM backbone, producing a scalar quality score q in [0,1] that modulates output confidence through a tiered non-linear fusion function (ECLO). Under strict cross-patient evaluation on the CHB-MIT Scalp EEG Database (n = 23 subjects, 198 seizures), CLSP-REQA achieves an AUC-ROC of 0.7426 ± 0.0199, outperforming the unadapted cross-patient baseline of 0.69 reported by Jemal et al., using only 16 EEG channels compared to 23 in prior work, and without requiring any target-patient data or domain adaptation. On the SIENA Scalp EEG Database (n = 14 subjects, 47 seizures), CLSP-REQA achieves AUC 0.7012 ± 0.0249, substantially surpassing the best domain-adapted cross-patient result of 0.61 on the same dataset, demonstrating strong cross-dataset generalisation. The framework outputs a structured four-tuple (p, q, c, Phi_SHAP) directly compatible with closed-loop neurostimulator interfaces.
机器学习
[LG-0] IntraShuffler: A Privacy Preserving Framework for Heterogeneous DP Federated Learning
链接: https://arxiv.org/abs/2606.02563
作者: Farhin Farhad Riya,Olivera Kotevska,Jinyuan Stella Sun,
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Heterogeneous Differential Privacy (HDP) in Federated Learning (FL) allows clients to select individual privacy budgets ( \varepsilon_i ) according to institutional policies and data sensitivity. In practice, many HDP-FL systems employ \varepsilon -aware server aggregation to improve model utility by re-weighting client updates according to their declared privacy budgets. However, gradient updates in FL retain structural patterns induced by non-independent and identically-distributed (non-IID) data, and these additional signals exposed by \varepsilon -aware aggregation create new opportunities for inference by an honest-but-curious server. In this work, we first show that a server equipped with gradient denoising and surrogate modeling can mount a \emphPrivacy Inference Attack that infers distributional attributes of clients and links updates from the same client across training rounds, measured via surrogate inference accuracy and linkage success, under realistic knowledge constraints. The Shuffle-Model has been widely studied as a defense against such inference risks by anonymizing update sources, but it is fundamentally incompatible with HDP-FL \varepsilon -aware aggregation. To address this challenge, we propose \textbfIntraShuffler, a middleware defense framework designed for HDP-FL systems. IntraShuffler introduces a privacy-aware shuffling mechanism that groups clients into privacy-compatible buckets and performs parameter-level shuffling within each bucket to disrupt persistent gradient structure while preserving \varepsilon -aware aggregation. Experiments across four different datasets show that IntraShuffler reduces gradient recoverability by over 60% and decreases surrogate inference accuracy from 0.78 to 0.33 while maintaining comparable model utility across multiple FL aggregation rules. Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2606.02563 [cs.LG] (or arXiv:2606.02563v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.02563 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-1] A Biconvex Formulation for Stable Transport of Mixture Models with a Unique Solution
链接: https://arxiv.org/abs/2606.02515
作者: Yeganeh Marghi,Kelly Jin,Uygar Sümbül
类目: Machine Learning (cs.LG)
*备注:
Abstract:Optimal transport (OT) provides a principled framework for mapping between probability distributions. Despite extensive progress, applying OT to large-scale data remains computationally demanding, and the resulting pointwise transport plans are often difficult to interpret. We introduce Optimal Mixture Transport (OMT), a scalable framework that shifts the transport paradigm from individual samples to mixtures of subpopulations, reformulating the transport problem as a strictly biconvex optimization with a unique global minimizer. We further establish theoretical guarantees on the stability of the OMT map, showing that bounded perturbations of the underlying distributions lead to bounded changes in the transport plan. By formulating subpopulations as exponential-family distributions, OMT decouples computational complexity from the sample size, scaling solely with the number of mixture components. We demonstrate the effectiveness and practicality of OMT on a wide range of synthetic benchmarks and real-world datasets, including image data and large-scale single-cell RNA sequencing measurements.
[LG-2] Expressivity of congruence-based architectures for DNNs on positive-definite matrices
链接: https://arxiv.org/abs/2606.02490
作者: Antonin Oswald,Estelle Massart
类目: Machine Learning (cs.LG)
*备注: Accepted for Eusipco 2026
Abstract:This work studies neural architectures for classifying symmetric positive-definite matrices, focusing on congruence-like layers, in which the input matrix is multiplied on the left and right by a (possibly rectangular) weight matrix W and its transpose. Such layers lie at the core of the celebrated SPDNet and have also been employed independently for dimensionality reduction on positive-definite data. We show that the (semi)-orthogonality constraint commonly imposed on W limits the expressivity of these layers: for certain activation functions, the resulting architecture collapses to a one-hidden-layer equivalent. This lack of expressivity follows from a loss of spectral diversity in congruence-like layers for semi-orthogonal W and is a direct consequence of Poincaré’s separation theorem. We then examine the choice of the final classifier, comparing several Riemannian classifiers and discussing their compatibility with the feature maps produced by congruence-like layers.
[LG-3] Physics-Informed Residuals for Adaptive Mesh Refinement in Finite-Difference PDE Solvers
链接: https://arxiv.org/abs/2606.02475
作者: Henry Kasumba,Ronald Katende
类目: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 17 pages, 5 tables, 5 figures
Abstract:Classical finite-difference solvers remain reliable tools for partial differential equations, but their efficiency depends on where mesh resolution is placed. Uniform refinement can waste degrees of freedom when solution difficulty is localised near sharp gradients, fronts, oscillations, or constraint-sensitive regions. This paper studies a hybrid strategy in which a physics-informed neural network (PINN) is used not as the final solver, but as an off-grid residual probe for adaptive mesh refinement. The PINN residual is sampled over the domain, converted into cellwise indicators, and used to guide refinement before the final approximation is computed by a finite-difference solver. The method is evaluated on three benchmarks. The main full-solver validation uses the one-dimensional viscous Burgers equation with a nonuniform finite-difference solve on the adapted meshes. PINN-threshold refinement attains final relative L^2 error 0.021067 with 60 degrees of freedom, compared with 0.022617 for uniform refinement with 192 degrees of freedom. At matched mesh size, PINN-threshold reduces the error by about 67.5% . PINN-D"orfler refinement gives similar performance, with error 0.021264 using 58 degrees of freedom. A gradient indicator remains slightly more accurate, so the result supports usefulness rather than universal superiority. Manufactured 2D and 3D proxy tests, based on a nonlinear Schr"odinger equation and an incompressible Navier–Stokes system, show that PINN residuals can organise structured refinement and improve over random refinement, although they do not consistently outperform gradient or uniform baselines. The results support PINN-guided AMR as a residual-indicator strategy for transferring physics-informed diagnostic information into finite-difference mesh adaptation while preserving the classical solver as the final approximation engine. Comments: 17 pages, 5 tables, 5 figures Subjects: Numerical Analysis (math.NA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG) MSC classes: 65M06, 65M22, 65M50, 65M60, 35Q30, 35Q55, 76D05 Cite as: arXiv:2606.02475 [math.NA] (or arXiv:2606.02475v1 [math.NA] for this version) https://doi.org/10.48550/arXiv.2606.02475 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-4] Speculative Sampling For Faster Molecular Dynamics ICML2026
链接: https://arxiv.org/abs/2606.02455
作者: Arthur Kosmala,Stephan Günnemann,Meng Gao,Brandon Wood
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Computation (stat.CO)
*备注: Forty-Third International Conference on Machine Learning (ICML 2026). 32 pages, 14 figures, 8 tables
Abstract:Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase single-system throughput with concurrent compute. To address this, we introduce Langevin Speculative Dynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MD without adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from its target model distribution.
[LG-5] Spectral Audit of In-Context Operator Networks
链接: https://arxiv.org/abs/2606.02427
作者: Zhiwei Gao,Liu Yang,George Em Karniadakis
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Existing evaluations of neural operators and in-context operator learning rely primarily on prediction error, but accurate output prediction does not guarantee the correct local dynamical structure. A model may match solutions while exhibiting incorrect sensitivities, distorted frequency response, spurious mode coupling, or unstable tangent behavior. We introduce a Jacobian-based spectral audit for in-context operator learning. For a fixed prompt, we differentiate the network output with respect to the query function and view the resulting Jacobian as a learned tangent operator. Projecting it onto Fourier modes, we obtain a local spectral characterization of the inferred operator, including frequency-dependent gains, phase structure, and cross-mode coupling. The audit complements standard prediction metrics by testing whether the model reproduces local mechanisms of the underlying PDE operator rather than only outputs. Across benchmarks, the audit reveals distinct operator-level phenomena, including phase transport, viscosity-dependent damping, nonlinear mode coupling, and reaction–diffusion stability structure. It also detects failures partially hidden by prediction-error metrics, including high-frequency degradation, incorrect phase recovery, and prompt–operator inconsistencies. Corrupted or internally inconsistent prompts lead to degraded tangent-operator structure even when pointwise predictions remain partially accurate. Our results suggest that prediction accuracy and local operator fidelity are distinct properties of learned neural operators. Our framework also provides a diagnostic for stability, sensitivity, and operator consistency.
[LG-6] abPrep: Closing the Feature Engineering Gap in Tabular Benchmarks
链接: https://arxiv.org/abs/2606.02384
作者: Andrej Tschalzev,Nick Erickson,Yuyang Wang,Huzefa Rangwala,Stefan Lüdtke,Heiner Stuckenschmidt,Christian Bartelt
类目: Machine Learning (cs.LG)
*备注:
Abstract:Progress in tabular machine learning has largely focused on increasingly sophisticated model architectures. At the same time, feature engineering remains a critical yet underexplored component of real-world modeling pipelines that is entirely absent from modern benchmarks, which creates an unquantified evaluation gap. In this work, we introduce TabPrep, a lightweight preprocessing pipeline composed of feature generators that are carefully designed to target three specific structural data patterns. We show that many widely used model classes exhibit predictable blind spots to these patterns and that systematic feature engineering alone can establish new peak performance. Across the TabArena benchmark, integrating TabPrep into model training and tuning consistently improves performance for tree-based, neural, linear, and foundation models, often surpassing gains achieved by model-centric innovations alone. TabPrep outperforms previous automated feature engineering approaches in performance, efficiency, and applicability across datasets, enabling integration into large-scale benchmarks. By releasing TabPrep (see this https URL), we enable researchers to integrate feature engineering into their benchmarking setup, filling a longstanding gap in tabular evaluations.
[LG-7] Minimax-Optimal Policy Regret in Partially Observable Markov Games
链接: https://arxiv.org/abs/2606.02363
作者: Raman Arora
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner’s strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves \tildeO(\sqrtT) policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in T . We also prove a lower bound matching the \sqrtT and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.
[LG-8] Local Preferential Bayesian Optimization
链接: https://arxiv.org/abs/2606.02351
作者: Johanna Menn,Miriam Kober,Paul Brunzema,David Stenger,Sebastian Trimpe
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning from pairwise human feedback, yet existing methods struggle to efficiently optimize beyond low- and medium-dimensional problems due to their global search approaches. We address this limitation by developing a family of local PBO methods that transfer key ideas from high-dimensional BO to the preferential setting. In particular, we introduce local PBO methods which adapt trust-region and derivative-informed local search to pairwise preference feedback, where the latter exploits first- and second-order derivatives of the Laplace-approximated GP posterior. Our benchmark on GP sample paths, standard optimization benchmark functions, and policy-search tasks shows that local PBO methods are especially effective in high-dimensional and complex landscapes with steep optima. Compared with global preference-based baselines, they can substantially reduce cumulative regret, making them particularly useful for real-world preference-based optimization tasks such as policy search.
[LG-9] Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification
链接: https://arxiv.org/abs/2606.02341
作者: Amirmohammad Mohammadi,Joshua Peeples,Alexandra Van Dine
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 9 pages, 7 figures
Abstract:Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network’s representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.
[LG-10] Riemannian Gradient Descent for Low-Rank Architectures
链接: https://arxiv.org/abs/2606.02328
作者: Nicholas Knight
类目: Machine Learning (cs.LG)
*备注:
Abstract:We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank- r matrices, three geometries for rank- r partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.
[LG-11] Regularized Large Neighborhood Search
链接: https://arxiv.org/abs/2606.02294
作者: Germain Vivier-Ardisson,Laurent Demonet,Axel Parmentier,Mathieu Blondel
类目: Machine Learning (cs.LG)
*备注:
Abstract:Operations research practitioners typically tackle NP-hard combinatorial problems using large neighborhood search (LNS), a scalable heuristic that iteratively refines a current solution by locally re-optimizing subsets of its variables. In contrast, most existing approaches for integrating combinatorial optimization layers into neural networks still assume access to an exact global solution, which is computationally intractable. We bridge this gap by introducing regularized LNS (RLNS). By regularizing or perturbing local subproblems, we turn the LNS heuristic into an efficient MCMC sampler over the combinatorial set of feasible solutions, with associated Fenchel-Young losses. Under entropic regularization, we prove that RLNS performs exact block Gibbs sampling. Furthermore, adjusting the number of RLNS iterations allows us to interpolate between pseudolikelihood and exact maximum likelihood estimation, for end-to-end learning without global solvers. We demonstrate our approach on k -subset selection, generalized assignment, and stochastic vehicle scheduling problems.
[LG-12] Massive Spikes in LLM s are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization
链接: https://arxiv.org/abs/2606.02288
作者: Yung-Chin Chen,Chung Peng Lee,Ze-Wei Liou,Naveen Verma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: W_K contrastively amplifies the vector, W_Q aligns semantic tokens toward it, and W_V projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in “zones of rotational stability” utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.
[LG-13] Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction
链接: https://arxiv.org/abs/2606.02278
作者: Ruiyuan Li,Ajay Seth,Manon Kok
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures. Accepted at IFAC World Congress 2026
Abstract:State-space models are traditionally based on physical knowledge, but multi-step predictions from these physical models can be poor due to model inaccuracy. Black-box deep learning has shown promise as an alternative. However, these methods rely on the availability of large datasets and potentially available physical knowledge is neglected. We propose the PG-RSSNN, a physics-guided recurrent state-space neural network that incorporates recurrent structures to enable the use of non-saturating activation functions in multi-step prediction. It mitigates the vanishing gradients and eliminates the risk of numerical divergence in training seen in existing structures that feed back state estimates. Results across multiple systems with various physical model imperfections, from linear state-space models with Gaussian noise to a robotic arm and a cascaded water tank system, show that the proposed PG-RSSNN maintains stable training behavior, and improves multi-step predictions, as compared with black-box neural networks and physics-only models, even with limited training data and when physical models are only partially known.
[LG-14] ArrythML: An Autoencoder-Based TinyML Approach for On-Device Arrhythmia Detection on Resource-Constrained Embedded Systems
链接: https://arxiv.org/abs/2606.02256
作者: Nagarajan S,Kurian Polachan
类目: Machine Learning (cs.LG)
*备注: 19 pages,
Abstract:Our work presents a method for ECG segmentation and arrhythmia detection using Tiny Machine Learning (TinyML) models for real-time, on-device inference on resource-constrained embedded systems. We develop INT8 quantized autoencoder-based TinyML models with minimal layers and parameters for embedded deployment. These models are evaluated using a custom dataset derived from the MIT-BIH Arrhythmia Database and validated in both PC-based simulations and on-device environments. For the evaluations, over 95,000 ECG segments are processed on an ESP32-S3 microcontroller running the TensorFlow Lite Micro runtime. Post-evaluation, detailed analysis, including annotation-wise and record-wise failure analysis, is conducted to characterize model behavior across diverse ECG morphologies and rhythm patterns and to explain missed detections. In several cases, apparent misclassifications may correspond to early or subtle anomaly patterns labeled as normal in the reference annotations, highlighting the model’s sensitivity. A refined evaluation by filtering out ambiguous cases in the dataset shows that the best-performing DNN-based autoencoder achieves a recall of 84%, an F1-score of 79%, a model size of approximately 180 KB, and an inference latency of 9 ms on-device. These results demonstrate the feasibility of low-power, privacy-preserving embedded wearable systems capable of performing accurate arrhythmia detection entirely on-device. Comments: 19 pages, Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.02256 [cs.LG] (or arXiv:2606.02256v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.02256 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-15] BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers
链接: https://arxiv.org/abs/2606.02241
作者: Justin Deschenaux,Caglar Gulcehre
类目: Machine Learning (cs.LG)
*备注:
Abstract:Is the uniform-state diffusion framework a more powerful paradigm for discrete diffusion? Recent studies indicate that this may be the case. In combination with predictor-corrector samplers, uniform-state diffusion models (USDMs) produce samples of higher-quality than masked diffusion models (MDMs), and USDMs equal or outperform MDMs in downstream tasks, even though they exhibit greater perplexity. Two issues remain unresolved. First, existing work compares uniform and masked diffusion with un-informed correctors that re-inject noise at random positions, rather than targeting tokens most likely to be wrong. Second, prior work compares full-sequence diffusion models, so we do not know whether the same conclusion holds when tokens are generated block by block. To address these issues, we introduce BlockGen, a blockwise sequence model that we instantiate with both masked and uniform diffusion. BlockGen trains on a mixture of block sizes and its likelihood interpolates between AR and pure diffusion more finely than models with a fixed block size. BlockGen enables AR-informed predictor-corrector sampling (ARPC), which combines AR and diffusion predictions to re-generate unlikely tokens without an auxiliary verifier. Under ancestral sampling, uniform outperforms masked in the block-by-block setting, especially in the few-step regime. Under ARPC, the gap closes and reverses at high NFE. With block size 16 on GSM8K, MDMs reach slightly higher accuracy than USDMs, and we observe a similar trend in Generative Perplexity on OpenWebText. Find our code at this https URL.
[LG-16] Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation
链接: https://arxiv.org/abs/2606.02237
作者: Shucheng Li,Iolo Jones,Alexander Tong,Michael M. Bronstein
类目: Machine Learning (cs.LG)
*备注:
Abstract:Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.
[LG-17] A Doeblin-Anchored Contrastive Chart for Learning Markov Transition Kernels
链接: https://arxiv.org/abs/2606.02232
作者: Ao Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Learning a Markov transition model is not merely conditional density estimation: the learned object must be a valid transition kernel before it is iterated in downstream dynamics. This paper introduces a Doeblin-anchored contrastive chart, a statistical-to-dynamical coordinate framework for learning transition kernels from contrastive objectives. Given a restart law and an anchor strength, the chart mixes the target transition with the restart law. The resulting anchored kernel is simultaneously a Doeblin-minorized Markov kernel, the positive conditional law in a binary contrastive experiment, and an explicitly invertible coordinate for the original transition law. We prove that the anchored contrastive risk identifies the anchored transition density and calibrates excess risk to density error. Since inversion of a learned score may produce a signed or unnormalized object, we introduce a measurable Markovization operator that restores kernel validity while preserving integrated L^1 accuracy up to a constant factor. Oracle inequalities and Hölder–ReLU approximation bounds yield nonparametric rates for independent transition pairs. For stationary geometrically \beta -mixing trajectories, a conservative thinning-and-coupling extension yields the same reconstruction interface with an effective sample size. Occupancy-weighted perturbation bounds transfer one-step kernel error to finite-horizon marginal, path-law, and occupation-measure errors under explicit coverage.
[LG-18] Network Learning with Semi-relaxed Gromov-Wasserstein
链接: https://arxiv.org/abs/2606.02223
作者: Charles Dufour,Ulysse Naepels,Leonardo V. Santoro
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:Estimating the generative mechanism of large-scale networks is a fundamental challenge in statistical machine learning. It requires the identification of the latent connectivity structure, which is in general an NP-hard combinatorial problem due to the absence of canonical node labels. We address this challenge by allowing for probabilistic couplings, thereby relaxing the assignment problem. Our estimation framework can be formulated as a semi-relaxed Gromov-Wasserstein objective and provides a low-dimensional representation of the generative structure. We solve this via a block-coordinate conditional gradient algorithm. Despite the relaxation, the resulting solution is typically deterministic: in fact, we show that the optimality gap between the relaxed solution and the deterministic assignment vanishes at rate O(1/n) , where n is the number of nodes. This allows for tractable recovery of the underlying model and enables rigorous statistical analysis: we establish consistency and minimax-optimal convergence rates for both stochastic block models and Holder-smooth graphons. Our implementation scales efficiently with n , as demonstrated on both synthetic and real-world datasets.
[LG-19] Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment
链接: https://arxiv.org/abs/2606.02198
作者: Ashwin Singh,Carlos Castillo
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注: 17 pages, 12 figures
Abstract:Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness. Comments: 17 pages, 12 figures Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2606.02198 [cs.LG] (or arXiv:2606.02198v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.02198 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-20] Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards
链接: https://arxiv.org/abs/2606.02194
作者: Christian Scherer,Joe Watson,Theo Gruner,Daniel Palenicek,Ingmar Posner,Jan Peters
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures
Abstract:Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a \geq 90% success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.
[LG-21] he Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing
链接: https://arxiv.org/abs/2606.02184
作者: Michał Brzozowski,Neo Christopher Chung
类目: Digital Libraries (cs.DL); Machine Learning (cs.LG)
*备注:
Abstract:These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.
[LG-22] Low-Pass Flow Matching ICLR2026
链接: https://arxiv.org/abs/2606.02177
作者: Francesco M. Ruscio,T. Konstantin Rusch
类目: Machine Learning (cs.LG)
*备注: ICLR 2026 Delta Workshop
Abstract:Flow Matching typically relies on white noise sources, a choice often misaligned with the power spectra of natural data, which tend to decay with frequency. To address this, we introduce Low-Pass Flow Matching, a variant of Flow Matching based on an operator-modulated interpolant. This formulation induces a time-varying spectral bias that transitions from the source spectrum to a frequency-decaying bias as the path approaches the data. We validate our method on unconditional image generation tasks, including the scientific Galaxy10 dataset. Empirically, we show that our method is particularly effective when paired with adaptive ODE solvers, where it improves or preserves sample quality while substantially reducing sampling cost compared to standard baselines.
[LG-23] EEG-FuseFormer: A Transformer-Driven Feature Fusion Framework for Seizure Onset Prediction
链接: https://arxiv.org/abs/2606.02166
作者: Vigneshwar Hariharan(1),Chithra Reghuvaran(2),Arlene John(3),Nhat Pham(4),Omer Rana(4),Deepu John(2),Ganesh Neelakanta Iyer(1) ((1) National University of Singapore, (2) University College Dublin, (3) University of Twente, (4) Cardiff University)
类目: Machine Learning (cs.LG)
*备注: IEEE International Instrumentation and Measurement Technology Conference (I2MTC) 2026
Abstract:Epilepsy is one of the most common neurological disorders globally, characterized by recurring seizures and significantly impacting the quality of life. Despite advancements in diagnostic techniques, the mitigation of risks faced by epilepsy patients remains challenging due to the unpredictability of seizure events. An accurate forecast of seizure onset helps to reduce risks in epilepsy patients. In this paper, we propose EEG-FuseFormer, a transformer-based feature fusion framework for seizure-onset prediction that combines intermediate features extracted from Convolutional Neural Networks-Long Short-Term Memory (CNN-LSTM) and ResNet-18 networks. The CNN-LSTM architecture captures both spatial and temporal features directly from the raw signal, whereas the ResNet-18 extracts features from the Short-Time Fourier Transform (STFT) representation of the EEG signals. Fusion is carried out using a transformer encoder, and the final prediction is generated using fully connected dense layers. The CHB-MIT dataset was used to validate the proposed model. The results show that the proposed model achieves a mean recall of 98.85% and outperforms most of the state-of-the-art methods. This study evaluates the ability of the proposed feature fusion model to generalize in cross-patient testing scenarios. Fine-tuning pre-trained models on limited target patient data (target adaptation) within the cross-patient validation framework results in higher recall, precision, and F1-score metrics in comparison to the conventional cross-patient validation approach. Finally, the runtime-based computational complexity of the model is assessed across diverse hardware platforms to highlight the performance-complexity trade-off.
[LG-24] Hybrid Neural Ordinary Differential Equations for Data-Efficient Polymerization Modeling with Incomplete Kinetics
链接: https://arxiv.org/abs/2606.02145
作者: Marah Almanasreh,Alexander Mitsos,Eike Cramer
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures
Abstract:Accurate prediction of polymerization dynamics is essential for process design, control, and optimization. Yet, purely mechanistic models require labor-intensive parameterization of partially characterized kinetics, while purely data-driven models demand large, diverse datasets that are costly to obtain, particularly in early-design stages. We propose a hybrid Neural Ordinary Differential Equation (NODE) framework for data-efficient modeling of free-radical polymerization. Using batch polymerization of methyl methacrylate (MMA) as a case study, the mechanistic mass balances are retained explicitly, and only the partially-characterized effective radical concentration governing monomer consumption is learned from data through a neural network surrogate, while established reactions such as initiator decomposition, propagation, and termination remain physically modeled. The hybrid NODE is evaluated against a discrete-time feedforward neural network and a purely data-driven NODE under sparse data conditions, with models trained on as few as ten measurements under both regular and irregular sampling. The hybrid NODE consistently achieves lower prediction errors and more physically consistent extrapolations than both purely data-driven baselines. In a generalization scenario with noisy data and unseen operating conditions, the hybrid NODE achieves an RMSE of 0.013, compared to 0.31 for the data-driven NODE and 0.68 for the discrete-time model, demonstrating that learning only a closure term rather than the full dynamics is sufficient for reliable prediction under limited data availability.
[LG-25] meBlocks: Foundational and Continual Time-Series Blockbase – Extended Version KDD2026
链接: https://arxiv.org/abs/2606.02142
作者: David Campos,Bin Yang,Tung Kieu,Lei Chen,Chenjuan Guo,Christian S. Jensen
类目: Machine Learning (cs.LG); Databases (cs.DB)
*备注: 15 pages. An extended version of “TimeBlocks: Versatile and Continual Time-Series Blockbase” accepted at SIGKDD 2026
Abstract:The ongoing digitization has led to a proliferation of time-series data streams that monitor a variety of processes, from which valuable insights may be obtained. Further, the emergence of successful foundational language models begs the question of whether it is possible to achieve time-series models with the foundational properties of handling multiple tasks, while being sufficiently lightweight to allow real-time data stream processing. Existing foundational time-series models are often large and only effective in offline settings without stringent time and computational constraints, and where repeated model calibration is not needed. However, when applied to data streams, these models are ineffective due to their size and lack of support for continual calibration, which compromise their ability to deliver accurate real-time responses, their durability, and their deployability in hardware-limited settings. We propose TimeBlocks to enable versatile time-series processing by facilitating the efficient building of lightweight models suitable for multiple tasks under variable conditions. In particular, the method maintains a pool of interchangeable and modular model blocks that can be used to construct new time-series models. When presented with specific time-series data, a routing strategy iteratively selects the most suitable blocks to construct a lightweight and accurate model for the data. We equip TimeBlocks with a method called StreamCore to build a representative small subset of the data stream, which preserves a guaranteed approximation of the stream over time, enabling continual model calibration. An experimental study on multiple data sets and covering multiple tasks shows that TimeBlocks enables to build models capable of outperforming existing baselines.
[LG-26] Edge-aware Decoding for Neural Asymmetric Routing
链接: https://arxiv.org/abs/2606.02136
作者: Li Liang,Jinbiao Chen,Zizhen Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural asymmetric routing models increasingly encode directionality through matrix representations and asymmetry-aware attention. The final routing action, however, is not a node in isolation but a directed transition chosen under the current partial route. This creates a representation–decision mismatch: pairwise cost information may be encoded upstream while the final candidate logit is still largely parameterized as context–node compatibility. We propose a decoder-design principle for neural asymmetric routing: the final score should explicitly expose transition-level quantities suggested by the problem’s cost-to-go structure. We instantiate this principle with an edge-aware decoder that adds candidate-specific terms for the current directed edge, return-to-start closure, and static lightweight lookahead, while keeping the representation backbone fixed. On a controlled SVD/Sinkhorn asymmetric backbone, the decoder improves over the RADAR reference when trained on ATSP-100 and evaluated zero-shot on ATSP-100/200/500/1000, reducing the ATSP-1000 gap from 4.13% to 2.73% . On ACVRP, the same score-level modification shows the same qualitative trend under a richer routing state. ATSP ablations and directed-transition diagnostics sharpen the mechanism: the strongest evidence concerns sensitivity to the current directed edge, while closure and static lookahead act as heuristic continuation cues. The results support a mechanism study: a key decoder-side signal in neural asymmetric routing is decision-time exposure of transition-level edge information.
[LG-27] When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets 7 Modalities and Two Regimes
链接: https://arxiv.org/abs/2606.02106
作者: Julien Lafrance
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 5 figures. Code and data available at this https URL
Abstract:We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage with a tabular foundation model for in-context inference, applied identically across modalities once data is mapped to fixed vector representations. We evaluate it on 95 datasets spanning seven signal modalities – vision, audio, speech, text, molecular, time-series, and tabular. The main methodological contribution is to fix the comparison object: throughout the paper, performance is judged against the strongest lightweight tuned baseline on the same frozen features, while oracle selection, deployed selection, and specialized fine-tuning are reported separately. The pipeline is broadly competitive with strong lightweight tuned baselines on the same frozen features. It does not match the very best specialized models or heavily tuned pipelines on every task, but it stays close, and it runs much faster – typically 4 to 200 times faster than full backbone fine-tuning, often at comparable quality. We describe how to deploy the pipeline in practice: when to apply ETF preprocessing, how to stop its training without a validation split, how to set up the in-context classifier, and how to calibrate the resulting probabilities. The calibration step is non-cosmetic: TabICL produces well-calibrated probabilities by construction, ETF preprocessing initially disrupts that calibration, and the post-hoc rescaling restores it – yielding a per-prediction confidence signal that practitioners can use as a trust threshold for confidence-gated deployment. We also report where the pipeline should not be expected to help, and how to identify those cases in advance. Comments: 24 pages, 5 figures. Code and data available at this https URL Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.02106 [cs.LG] (or arXiv:2606.02106v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.02106 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-28] Beyond ell_2-norm and ell_infty-norm: A Curvature-Inspired ell_p-Norm Scheme for Deep Neural Networks
链接: https://arxiv.org/abs/2606.02078
作者: Jianhao Xu,Zhuang Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:The existing optimizers for deep neural networks (DNNs) typically rely on either the \ell_2 norm or the \ell_\infty norm, resulting in optimizers that do not adapt well to substantial changes in curvature across parameter dimensions. Generally, the training process of DNNs often exhibits strong curvature anisotropy in the early period, whereas in the later period, the training process of DNNs tends to move toward flatter regions with weaker anisotropy. Particularly, optimizers based on the (\ell_2)-norm are usually dominated by high-curvature directions, restricting updates of optimizers along with lower curvature direction and thus leading to a slower convergence rate. While optimizers based on the (\ell_\infty)-norm are prone to oscillations in flatter regions, due to the coordinate-wise updates of the same magnitude. To address these two extreme cases generated by \ell_2 and \ell_\infty norms, we propose a novel \ell_p -norm scheme with a dynamical value of p and incorporate it into stochastic gradient descent (SGD) and SGD with momentum (SGDM), leading to two novel optimizers with better generalization performance: \ell_p -SGD (LPSGD) and \ell_p -SGDM (LPSGDM). Particularly, the resulting optimizers suppress the dominance of high-curvature directions in the early period by utilizing a large p ( p2 ), followed by a gradual decrease of p toward 2 to enable more stable and refined updates, where the latter process is motivated by the cosine annealing strategy. We establish theoretical guarantees of the resulting algorithms and analyze that both LPSGD and LPSGDM achieve an (O(T^-1/2)) convergence rate for the nonconvex setting. Extensive experiments are conducted on benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, with multiple DNNs such as VGG-11, ResNet-18, and ResNet-50.
[LG-29] Planar Symmetric Pattern Generation
链接: https://arxiv.org/abs/2606.02073
作者: Ning Lin,Luxi Chen,Huaguan Chen,Jiacheng Cen,Chongxuan Li,Wenbing Huang,Hao Sun
类目: Machine Learning (cs.LG)
*备注:
Abstract:Generating objects with specific symmetries is essential in various real-world scenarios. However, adapting existing 2D continuous representations to enforce planar group symmetry remains a challenge, as the transformation of non-reflective group elements may disrupt continuity. To overcome this limitation, we propose a symmetrization framework for arbitrary planar groups. Our method transforms any 2D continuous representation into a symmetric one while preserving continuity. We provide the mathematical formulation of this representation, demonstrate its approximation capability for symmetric functions, and detail the construction methodology. We validate our approach through three visual design tasks (pattern design, paper-cutting design and stylized topology design) and one material design task. Experiments confirm that our representation enables effective symmetry control and demonstrate its broader applicability.
[LG-30] Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design
链接: https://arxiv.org/abs/2606.02061
作者: Michał Brzozowski,Neo Christopher Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds – a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.
[LG-31] Query-Limited Community Recovery in Stochastic Block Models
链接: https://arxiv.org/abs/2606.02055
作者: Sabyasachi Basu,Manuj Mukherjee,Lutz Oettershagen,Suhas Thejaswi
类目: Information Theory (cs.IT); Machine Learning (cs.LG); Social and Information Networks (cs.SI); Machine Learning (stat.ML)
*备注:
Abstract:We study exact community recovery in the two-community stochastic block model on n vertices under limited and noisy access to network data. The learner may query a noisy neighborhood oracle that reveals each true neighbor of a queried vertex independently with fixed probability and never returns non-neighbors, subject to a finite query budget. We consider both oracle-only access and a combined model where the learner also observes a single subsampled copy of the underlying graph. For oracle-only access, balanced uniform querying gives a sharp non-adaptive benchmark: when each vertex is queried the same integer number of times, the observations reduce to an SBM with attenuated edge probabilities and the Abbe-Bandeira-Hall exact-recovery threshold applies. We show that this benchmark is not adaptively optimal: a two-stage adaptive strategy succeeds with n+o(n) queries in a regime where balanced uniform querying requires m n queries for some m1 . With an additional subsampled graph, we prove a sublinear-query adaptivity gap: balanced data-independent uniform querying with a sublinear budget does not improve over the subsampled graph alone, whereas adaptive querying can target a small set of uncertain vertices and achieve exact recovery. Thus adaptive data acquisition can strictly improve the information-theoretic limits of exact recovery.
[LG-32] Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning
链接: https://arxiv.org/abs/2606.02044
作者: Bradley G. Karat,Maëliss Jallais,Ali R. Khan,Santiago Aja-Fernández,Jelle Veraart,Marco Palombo
类目: Machine Learning (cs.LG); Medical Physics (physics.med-ph)
*备注:
Abstract:Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.
[LG-33] Evaluating Real-World Generalizability of Algorithm Selection Models
链接: https://arxiv.org/abs/2606.02016
作者: Gjorgjina Cenikj,Jakub Kudela,Eva Tuba,Tome Eftimov
类目: Machine Learning (cs.LG)
*备注: 10 pages, 12 figures
Abstract:Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.
[LG-34] Graph Edit Distance Formulation for the Vehicle Routing Problem: Theory and Analysis
链接: https://arxiv.org/abs/2606.01987
作者: Adel Dabah
类目: Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注:
Abstract:We show that the Vehicle Routing Problem (VRP) can be reformulated as a Graph Edit Distance (GED) maximization problem. Under a simple edge-deletion cost model, minimizing total route cost is equivalent to maximizing the total weight of edges deleted from the complete instance graph. This formulation models VRP at the edge level, where solutions are defined by selected edges rather than route sequences, enabling structural analyses that are difficult in classical formulations: per-edge attribution of solution quality, decomposition of the optimality gap, characterization of solution sparsity, and identification of edges that are hard to reach by greedy construction. Theoretically, we establish a merge-decomposition theorem showing that Clarke-Wright savings equal per-merge GED increments, and an approximation-transfer theorem that turns GED approximation ratios into VRP cost bounds. Using this reformulation, we analyze 90 CVRP benchmark instances with known optimal solutions. We find that optimal routing graphs use only 5.5% of available edges, that approximately 3.0% of optimal edges are consistently not found by Clarke-Wright heuristics under repeated restarts, and that the cost gap decomposes into missed optimal edges and substituted non-optimal edges of comparable total weight. The edge-additive objective provides a natural per-edge supervision signal for future graph neural network approaches to edge prediction, suggesting a potential connection to graph neural network approaches that we leave for follow-up work.
[LG-35] Flow-Transformed Implicit Processes for Function-Space Variational Inference
链接: https://arxiv.org/abs/2606.01954
作者: Luis A. Ortega,Andrés R. Masegosa,Thomas D. Nielsen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 24 pages, 4 figures, 10 tables. Pre-print submitted for revision
Abstract:Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box \alpha objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.
[LG-36] Randomized Least Squares Value Iteration itself is Joint Differentially Private
链接: https://arxiv.org/abs/2606.01952
作者: Haiyang Lu,Pratik Gajane,Shaojie Bai,Mohammad Sadegh Talebi
类目: Machine Learning (cs.LG)
*备注: 12 pages, 0 figures
Abstract:As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users’ sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is (\varepsilon(\delta),\delta) -joint differentially private in tabular MDP as is with \varepsilon(\delta) = \frac2AKH^2\log(2HSA) + 2\sqrt\frac2AK\log(1/\delta)H^2\log(2HSA) , where S and A are the number of states and actions respectively, H is the length of an episode and K is the number of episodes.
[LG-37] MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction
链接: https://arxiv.org/abs/2606.01891
作者: Li Ye,Xinhang Zhou,Xingyu Yang,Ruofeng Tong,Hailong Li,Peng Du,Min Tang
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 20 pages, 12 figures, 5 tables
Abstract:Mid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces–scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.
[LG-38] Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation
链接: https://arxiv.org/abs/2606.01890
作者: Woojun Jung,Susik Yoon
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.
[LG-39] G2LoRA: Gradient Orthogonal Low-Rank Adaptation Framework for Graph Continual Learning on Text-Attributed Graphs KDD2026
链接: https://arxiv.org/abs/2606.01873
作者: Yuhan Wang,Yibo Ding,Yutong Ye,Mufan Zhao,Wenbo Zhang,Ruijie Wang,Jianxin Li
类目: Machine Learning (cs.LG)
*备注: Accepted by KDD 2026
Abstract:LLM-as-Aligner has emerged as a prevalent pre-training paradigm for Text-Attributed Graphs(TAGS), aligning graph and text modalities into a shared embedding space via CLIP-style contrastive learning. While effective on individual downstream tasks, we observe severe catastrophic forgetting when such models are sequentially fine-tuned on streaming tasks. Although parameter-efficient fine-tuning alleviates forgetting to some extent, it remains insufficient to resolve task interference and ineffective knowledge transfer. In this work, we study graph continual learning for LLM-as-Aligner models on TAGs, with the goal of mitigating interference while promoting positive transfer across tasks. This setting introduces two fundamental challenges: (1) heterogeneous downstream tasks induce shifting optimization objectives, hindering unified fine-tuning; and (2) graph and text encoders exhibit different sensitivities to adaptation, making uncoordinated updates prone to misalignment. To address these challenges, we propose G2LoRA, a continual learning framework for TAGs. G2LoRA unifies node-, link-, and graph-level tasks under a single graph–text alignment objective, and enables consistent optimization across domain/class/task incremental modes. To reduce task interference while encouraging positive transfer, G2LoRA performs category-aware gradient projection in structured subspaces, resolving conflicting updates and enabling conditional backward transfer to balance forward and backward knowledge flow. To further prevent cross-modal drift, G2LoRA introduces gradient magnitude modulation to coordinate update rates between graph and text encoders. Extensive experiments on benchmark datasets demonstrate that G2LoRA consistently outperforms strong baselines across different backbone architectures, achieving superior continual performance and transferability.
[LG-40] ask-Induced Representational Invariances Depend on Learning Objective in Deep RL
链接: https://arxiv.org/abs/2606.01868
作者: Manu Srinath Halvagal,Sebastian Lee,SueYeon Chung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection. The ability to learn abstract representations of high-dimensional state spaces underlies much of this success. However, theoretical understanding of these learned representations remains limited, hindering direct comparisons between models and animal learning. We address this gap by analyzing deep RL representations through the lens of MDP reduction theory. Investigating canonical RL algorithms in a navigation task, we find that even when performance is comparable, the value-based method (DQN) learns representations that are invariant to MDP homomorphism symmetries, while the policy-gradient method (PPO) learns representations invariant to action symmetries. These differences emerge consistently across domains, have downstream consequences for transfer learning, and appear in LLMs in a prompt-dependent manner. Our findings provide a principled approach to comparing learned representations across RL algorithms, with demonstrated practical implications and possible insights for neural coding in the brain.
[LG-41] Continual Learning as a Multiphase Moving-Boundary Problem
链接: https://arxiv.org/abs/2606.01863
作者: Snigdha Chandan Khilar
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:
Abstract:Continual learning struggles to balance retaining past knowledge with absorbing new tasks. Stefan-CL elegantly resolves this stability-plasticity dilemma through the physics of melting. It frames consolidated knowledge as a protected “solid” and unused capacity as an adaptable “liquid.” As the network learns, this boundary expands, governed by a “latent heat” tuning dial. By mathematically freezing the learned interior, Stefan-CL cuts forgetting to near zero, matching memory-heavy baselines without storing raw data, forging a beautiful, physics-grounded path for AI.
[LG-42] A Theoretical Framework for Self-Play Theorem Proving Algorithms
链接: https://arxiv.org/abs/2606.01861
作者: Thomas Chen,Zhiyuan Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Self-play, a type of training algorithm that enables a model to self-improve, has recently shown promising empirical results in the context of formal theorem proving using Large Language Models (LLMs). (Dong Ma, 2025) instantiate self-play with two cooperating agents: a prover, which proves theorems, and a conjecturer, which generates new theorems as a curriculum to the prover. In this paper, we provide a theoretical framework for understanding the self-improvement capabilities of self-play algorithms for theorem proving. First, we formalize the set of theorems as a graph, with nodes as theorems and edges between pairs of theorems with similar semantics. We introduce a set of primitive assumptions that characterize the guarantees of a trained prover and how a conjecturer can access the structure of the graph. Second, we show that if the underlying graph of theorems is well-connected, then a prover-conjecturer system, where the conjecturing algorithm is based on a reversible random walk, is sufficient to grow the set of proved theorems exponentially. Third, motivated by an issue encountered empirically by self-play algorithms, where the conjecturer tends to generate artificially complex and non-fundamental theorems, we propose a diversity measure for a training distribution of theorems generated by a conjecturer and an improved conjecturing algorithm that locally maximizes this diversity measure, by computing the diffusion similarity between neighboring theorems in the theorem graph. Finally, we describe a method to compute the diffusion similarity by using contrastive learning to embed nodes into Euclidean space and then computing the inner-product between embeddings.
[LG-43] he Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space ICML2026
链接: https://arxiv.org/abs/2606.01847
作者: Bing-Cheng Chuang,I-Hsuan Chu,Bor-Jiun Lin,YuanFu Yang,Min Sun,Chun-Yi Lee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: ICML 2026 Accepted
Abstract:Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the \textbfEuclidean Fallacy : representing SE(3) poses as flat \mathbbR^12 vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce \textbfLie Diffuser Actor (LDA) , a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC \rightarrow D, LDA improves average task length from 3.27 to 3.51 ( +7.3% ). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.
[LG-44] Mos-Gen: A Generative Molecular Framework for Mosquito Insecticide Design
链接: https://arxiv.org/abs/2606.01846
作者: Lina Wang,Yaning Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mosquito-borne infectious diseases cause more than 700000 deaths worldwide each year. The long-term use of conventional chemical insecticides has induced serious resistance problems, creating an urgent need to develop novel, highly effective, and ecologically sustainable alternatives. While existing artificial intelligence approaches in this domain have focused primarily on activity prediction and classification, they leave a critical gap in the de~novo generation of novel molecular scaffolds. In this study, we propose Mos-Gen, a motif-aware generative collaborative framework that couples the pretrained molecular representation model Uni-Mol with a variational autoencoder (VAE), specifically tailored for the design of disulfide-containing allicin derivatives as mosquito insecticides. Among the generated candidates, fourteen compounds – comprising nine predicted positives and five predicted negatives – were selected for chemical synthesis and experimental validation. The hit rate among the predicted positives reached 78%, whereas none of the predicted negatives exhibited mosquitocidal activity. These experimental results fully validated the high-precision screening capability of the Mos-Gen framework.
[LG-45] Observation Not Prediction: Conversation-Level Disaggregated Scheduling for Agent ic Serving
链接: https://arxiv.org/abs/2606.01839
作者: Jianru Ding,Ryien Hosseini,Pouya Mahdi Gholami,Mingyuan Xiang,Henry Hoffmann
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:
Abstract:LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn’s decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them. We show this dependence on prediction is imposed by the scheduling unit, not the workload. Raising the scheduling unit from the turn to the conversation converts turn-level irregularity into a stable, two-phase structure: 1) a compute-bound turn-1 prefill followed by 2) a long, memory-bound tail. Thus, with the conversation as the scheduling unit, placement reduces to reading the first-turn input length and per-decoder KV occupancy, both directly observable. We instantiate this principle in ConServe, which routes the first-turn prefill to a high-throughput prefiller, transfers the KV cache exactly once, and pins the conversation to a single decoder for its entire tail, with no learned model of decode-side cost. Against a per-turn prediction baseline, ConServe reduces p95 time-to-first-effective-token (the latency of a conversation’s first user-visible output) by 51.08% and improves energy efficiency by 7.51% while preserving last-turn TBT and SLOs; mapping the two phases onto heterogeneous GPU tiers adds a further 22.75% in energy efficiency.
[LG-46] ree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits
链接: https://arxiv.org/abs/2606.01799
作者: Pu Wang,Yao-Xiang Ding
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study N -armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within O(N) comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve O(N) sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve O(N) weak regret; (3) enjoy the same O(N \log T) guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with O(N) guarantees for both, eliminating the sub-optimal gap of O(\log N) in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.
[LG-47] A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling
链接: https://arxiv.org/abs/2606.01720
作者: Da Chang,Qiankun Shi,Lvgang Zhang,Yu Li,Ruijie Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor (Y_i(\mathcal C)); in the uniform full-participation full-batch regime, it yields (\widetilde\mathcal O(n^-1+n^-1/2)) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton–Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.
[LG-48] Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging ICML2026
链接: https://arxiv.org/abs/2606.01717
作者: Minsik Choi,Geewook Kim
类目: Machine Learning (cs.LG)
*备注: 32 pages, 5 figures. Accepted for publication at ICML 2026
Abstract:Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture – matching or exceeding centralized joint training with minimal cost overhead – and transfers to text-only FLAN. Our code is available at this https URL.
[LG-49] KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity
链接: https://arxiv.org/abs/2606.01702
作者: Ziqin Gao,Zhijie Yang,Qiang Zou
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: 18 pages
Abstract:Deep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6% accuracy with only 250 training samples, 95.8% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.
[LG-50] CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
链接: https://arxiv.org/abs/2606.01695
作者: Swapnil Parekh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model’s hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen’s d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.
[LG-51] IstGPT : LLM -based Anomaly Detection for Spatial-Temporal Graph in Industrial Systems
链接: https://arxiv.org/abs/2606.01691
作者: Yuchen Zhang,Ning Xi,Pengbin Feng,Shigang Liu,Jianfeng Ma,Yulong Shen,Yanan Sun,Xiaolin Zhou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Industrial Internet systems face increasing threats from sophisticated industrial control system (ICS) attacks, resulting in critical safety incidents. However, existing tools exhibit limited effectiveness in real-time anomaly detection due to the complex dependencies among sensors and actuators. To tackle this, we present IstGPT, the first industrial anomaly detection tool based on LLMs and graph learning to provide real-time protection against a wide range of ICS attacks. IstGPT achieves fine-grained and precise modeling on spatial-temporal dependencies in industrial cyber-physical systems. It first leverages industrial multi-modal knowledge, including operational data, technical documents, and system diagrams, to extract sensor-actuator dependency graphs via multi-stage prompt engineering. Then, LLM-Optimation iteratively refines the graph based on node accuracy, edge consistency, and logical coherence. Finally, IstGPT integrated improved graph neural networks with an encoder-decoder architecture to detect anomalies via reconstruction errors. We evaluate IstGPT against 12 state-of-the-art baselines on 9 datasets, including 2 public, 6 simulated, and a real-world robotic arm dataset. IstGPT achieves the best F1-scores and eTaF1 (a newer time-aware metric) across nine datasets. We further discuss the feasibility of deploying IstGPT in real-world industrial scenarios.
[LG-52] Dont Let a Few Network Failures Slow the Entire AllReduce
链接: https://arxiv.org/abs/2606.01680
作者: Peiqing Chen,Jiedong Jiang,Nengneng Yu,Yuefeng Wang,Sixian Xiong,Wei Wang,Zaoxing Liu
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注:
Abstract:Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL’s fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.
[LG-53] RDA: Reward Design Agent for Reinforcement Learning
链接: https://arxiv.org/abs/2606.01672
作者: Hojoon Lee,Ajay Subramanian,Ben Abbatematteo,Vijay Veerabadran,Pedro Matias,Karl Ridgeway,Nitin Kamra
类目: Machine Learning (cs.LG)
*备注: Accepted to RLC’26
Abstract:Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on this https URL.
[LG-54] ATLAS: Agent ic Test-time Learning-to-Allocate Scaling
链接: https://arxiv.org/abs/2606.01667
作者: Peijia Qin,Qi Cao,Pengtao Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator’s direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.
[LG-55] Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim
链接: https://arxiv.org/abs/2606.01665
作者: Bo Li,Chen Zhang
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, 2 tables. Presented at AI-DEEDS 2026 Workshop, ACM Sustainability Week, Banff, Canada (non-archival)
Abstract:We quantify the energy floor – the minimum achievable cost given action space constraints – for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min – a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18–USD 37.42), demonstrating that equipment minimum power – not algorithmic design – imposes the binding constraint.
[LG-56] Gate the Filter Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs
链接: https://arxiv.org/abs/2606.01660
作者: Zichao Yue,Zhiru Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.
[LG-57] IMWM: Intuition Models Complement World Models for Latent Planning
链接: https://arxiv.org/abs/2606.01626
作者: Baoqi Gao,Ruize Han,Miao Wang,Song Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Planning with a learned latent world model is a promising route to control from raw pixels, but a strong world model alone is not enough. We show this experimentally: even with a perfect world model (operationalized by replacing the learned forward predictor with an idealized rollout of the true environment dynamics), a finite-budget sample-based planner still fails on some tasks, indicating that the bottleneck can lie in search rather than in world-model accuracy. Motivated by this gap, we propose IMWM (Intuition Model + World Model), which pairs the world model with an intuition model trained from demonstrations to recognize promising actions. The two models collaborate through three lightweight components: (i) Retrieval Initialization, which initializes the planner’s action proposal from a retrieved demonstration; (ii) Hybrid Cost, which combines the intuition score with the world-model rollout cost; and (iii) a Reliability Gate, which adjusts how much the planner trusts intuition in each setting. Across four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, and OGBench-Cube), IMWM has higher mean success than the world-model-only planner on all four, with the largest gains on Two-Room (99.2%, +11.5 percentage points) and OGBench-Cube (94.7%, +28.5 percentage points).
[LG-58] Learning Chaotic Dynamics through Second-Order Geometric Supervision
链接: https://arxiv.org/abs/2606.01596
作者: Shinhoo Kang,Hai V. Nguyen,Tan Bui-Thanh
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 37 pages, 15 figures, 6 tables
Abstract:Learning chaotic dynamical systems from data requires more than short-term predictive accuracy: the learned model must preserve the attractor geometry and its invariant statistics. Trajectory (zero-order) and Jacobian (first-order) matching supervise the values and tangent structure of the vector field, but neither constrains how the field bends away from its tangent plane. A model can thus match values and tangents at the supervised states yet curve differently from the truth, remaining locally accurate while drifting toward spurious attractors and distorting long-time statistics. We show that enforcing second-order consistency mitigates these failures, but forming the full Hessian is prohibitive in high dimensions. We propose model-constrained randomized Jacobian matching, which compares the Jacobians of the true and learned vector fields at randomly perturbed inputs. A Taylor expansion shows that the expected randomized Jacobian loss decomposes into the nominal Jacobian mismatch plus a Hessian mismatch scaled by the noise variance, implicitly enforcing second-order consistency at \mathcalO(d^2) cost without forming the \mathcalO(d^3) Hessian tensor. Using only Jacobian evaluations, the method scales to high dimensions where explicit Hessian matching does not. Numerical experiments confirm that second-order methods are robust. For Lorenz~63, first-order methods produce catastrophic Lyapunov-exponent outliers under minimal temporal supervision, which second-order methods eliminate while recovering the correct attractor. For coupled Lorenz~96, an out-of-distribution forcing sweep separates the methods: all agree up to F=16 , but beyond F=18 only second-order methods preserve the invariant measure and Lyapunov spectrum. On both systems, randomized Jacobian matching performs comparably to explicit Hessian matching at much lower cost.
[LG-59] Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation
链接: https://arxiv.org/abs/2606.01595
作者: Fang Wan,Jingxiang Qu,Yi Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at this https URL.
[LG-60] RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning
链接: https://arxiv.org/abs/2606.01566
作者: Amanda S Barnard
类目: Machine Learning (cs.LG)
*备注: 19 pages, 2 figure plates, 8 tables
Abstract:Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.
[LG-61] MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference
链接: https://arxiv.org/abs/2606.01563
作者: Yu Li,Binxu Li,Tian Lan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.
[LG-62] Everywhere Learning: Artificial Intelligence with Pointwise Constraints
链接: https://arxiv.org/abs/2606.01557
作者: Ignacio Boero,Ignacio Hounie,Luiz Chamon,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Everywhere learning is a new paradigm whereby Artificial Intelligence (AI) systems are trained to satisfy loss constraints with probability one over the data distribution. This is in contrast to the standard paradigm of training AI systems to minimize average losses. We develop an approximate duality theory to substantiate a generalization analysis that establishes the proximity between solutions of empirical and statistical everywhere learning problems. Our results show that dual variables reweigh the data distribution towards points in which loss constraints are more difficult to satisfy and that generalization is controlled by the mismatch between the concentration of mass of the data distribution and the concentration of mass on points where constraints are more difficult to satisfy. We further show that we can control generalization with a sparse L1 penalty on constraint relaxations. We illustrate the merits of everywhere learning with an experiment in agentic classification for language model tasks.
[LG-63] Flexible Online Representation Learning Based on Similarity Matching IJCNN2023
链接: https://arxiv.org/abs/2606.01546
作者: Shagesh Sridharan,Yanis Bahroun,Anirvan M. Sengupta
类目: Machine Learning (cs.LG)
*备注: 6 pages, 3 figures. Originally accepted to IJCNN 2023 but not presented owing to visa issues
Abstract:Sparse high-dimensional representations are conducive to uncovering nontrivial structures in unsupervised exploration of data. Such a representation can deal with the dense connectivity in graphs relevant to community detection problems. However, sparse high-dimensional representations are capable of doing more, including manifold tiling and feature learning. Conventional algorithms optimize in the space of computationally intractable completely positive matrices or relax the problem to the space of doubly nonnegative matrices that scale with sample size in a way rendering them impractical for large data sets. Some of these methods also impose a row sum constraint, such as double stochasticity. Row sum constraints have the added advantage of being shift-invariant, in the context of manifold tiling. Constraints on the row sum of output similarity matrices require nontrivial online learning rules. Addressing these needs, we propose a versatile online biologically plausible learning algorithm capable of learning sparse shift-invariant representations, useful for clustering, manifold tiling, or sparse coding, depending on the data structure.
[LG-64] CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search
链接: https://arxiv.org/abs/2606.01544
作者: Cheonjun Park
类目: Machine Learning (cs.LG)
*备注: 10 pages
Abstract:Deploying Large Language Models (LLMs) in practice incurs substantial memory and computational costs. Post-training pruning (PTP) is an effective approach to reducing these costs by removing weights without additional training. Among existing methods, RIA introduces relative importance scores normalized by row and column sums, achieving state-of-the-art accuracy. However, RIA considers only 1D cross-shaped (row/column) directional information and assigns equal weight to row and column contributions. In this paper, we propose \textbfCRePE, which incorporates 2D local neighborhood context and adaptive coefficients into Relative Importance scoring. CRePE consistently outperforms existing PTP methods across diverse models and sparsity settings. However, identifying optimal adaptive coefficients via perplexity (PPL)-based hill climbing requires numerous PPL evaluations and approximately 11 hours of search time. To address this, we propose \textbfPHO (Proxy-based Hyperparameter Optimization), which eliminates the need for repeated PPL measurements and reduces the search time to approximately 20 minutes. Furthermore, the optimal hyperparameter configuration found by PHO on one model transfers well to other models, demonstrating strong generalization. Finally, we verify that CRePE can be orthogonally combined with existing techniques including Channel Permutation, non-uniform sparsity allocation, and re-pruning methods.
[LG-65] Rethinking the Role of Positional Encoding: Sliding-Window Transformers without PE Remain Turing Complete
链接: https://arxiv.org/abs/2606.01532
作者: Qian Li,Xinyu Mao,Shang-Hua Teng
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC)
*备注:
Abstract:Positional encoding (PE) is widely viewed as necessary for transformers to process ordered sequences: without them, the next-token map appears permutation-invariant in its context tokens. This intuition underlies all prior universality results, which rely on positional information to prove that transformers with chain-of-thought can perform arbitrary computation, i.e., they are Turing complete. We revisit this belief in the regime most relevant to long-form reasoning, where generation proceeds through a finite sliding context window. Our opening perception is that the window mechanism itself (mildly) breaks the permutation symmetry. To distill and precisely capture the degree of this added expressiveness, we introduce an abstract autoregressive model, the HIST model, in which each update depends only on constant-size internal state and the token-count histogram within the current window. We prove that this HIST model is Turing complete by showing that the evolution of the window can reveal the token that has just left the window, which suffices to simulate Turing-complete Post machines. We then construct a sliding-window transformer over a constant-size token alphabet, without PE, and show that it can simulate the HIST model. Our result demonstrates that positional encodings are not indispensable for transformers to perform universal computation: The window sliding itself already breaks permutation symmetry and captures sufficient positional information.
[LG-66] Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses
链接: https://arxiv.org/abs/2606.01527
作者: Matthew Regehr,Gautam Kamath,Andrew Lowy
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注:
Abstract:Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals’ data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unlearning in smooth strongly convex stochastic optimization, but the fundamental statistical cost of unlearning has remained unclear. We nearly resolve this problem by proving upper and lower bounds on the excess population risk of approximate \varepsilon -unlearning; our bounds are tight up to a condition-number factor. For mean estimation over the unit ball, our upper and lower bounds match. The optimal rate is the usual statistical error plus an unlearning penalty that interpolates between the retraining-from-scratch rate and an exponentially smaller term as \varepsilon/d grows, where d is the dimension of the model. In particular, when \varepsilon \gg d , our \varepsilon -unlearning algorithm offers an exponential accuracy improvement over retraining the model from scratch and differentially private baselines. On the other hand, when \varepsilon \le d , retraining from scratch is optimal.
[LG-67] Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors
链接: https://arxiv.org/abs/2606.01525
作者: Junjing Zheng,Xinyu Zhang,Xiangfeng Qiu,Chengliang Song,Weidong Jiang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Semi-supervised hierarchical clustering aims to learn a tree structure consistent with data patterns and user-provided supervision. Supervision is usually given as leaf-level relations, such as pairwise must-link/cannot-link constraints or triplet-wise must-link-before constraints. Although useful for regulating local sample relations, such supervision does not directly indicate which samples should form coherent subtrees. Consequently, the non-leaf structure of the learned tree may deviate from the hierarchical organization preferred by ground-truth labels. To address this limitation, we propose a semi-supervised hyperbolic hierarchical clustering method with set-level structural priors. The main contribution is to introduce sets as basic modeling units for hierarchy learning. Each set denotes samples expected to cohere within a subtree and is induced from leaf-level supervision together with a learned constraint-consistent similarity structure. These sets act as soft structural priors for subtree-level supervision, allowing supervision to guide non-leaf hierarchy formation beyond local leaf-level relations. Specifically, we first learn constraint-consistent embeddings to obtain a reliable set partition, then construct constraint-induced sets and estimate inter-set similarities to form set-level structural priors. Finally, these priors are incorporated into a hyperbolic hierarchy objective for continuous tree optimization. Experiments on eleven benchmark datasets and ablation studies show that the proposed method consistently improves label consistency over representative hierarchical clustering baselines while also enhancing similarity-based tree quality.
[LG-68] Fast Generalization after Interpolation via Critically Damped Momentum Optimization
链接: https://arxiv.org/abs/2606.01521
作者: Luca Muscarnera,Silas Ruhrberg Estévez,Yuanzhang Xiao,Mihaela Van der Schaar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.
[LG-69] Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks
链接: https://arxiv.org/abs/2606.01432
作者: Parastoo Farajpoor,Alireza Pourreza,Mohammadreza Narimani,Ashraf El-Kereamy,Matthew W. Fidelibus
类目: Machine Learning (cs.LG); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Machine Learning (stat.ML)
*备注: 8 pages, 5 figures. Author-accepted version of the SPIE conference paper
Abstract:Accurate modeling of leaf spectral reflectance from physiological and biochemical traits is essential for advancing remote sensing applications in plant science and precision agriculture. Widely used radiative transfer models, such as PROSPECT-PRO, rely on generalized trait-reflectance relationships developed from a wide range of species, which may not fully capture the spectral behavior of specific crops like grapevines. In this study, we developed a trait-to-spectra prediction model using a multi-head attention neural network trained on a grapevine-specific dataset that includes 16 leaf traits measured across multiple varieties, growth stages, and years. The model was evaluated using stratified 5-fold cross-validation and achieved an average coefficient of determination (R^2) of 0.84 and normalized root mean squared error (NRMSE) of 1.52 percent, demonstrating high accuracy and generalizability. When compared to PROSPECT-PRO in forward mode, the neural network exhibited lower mean absolute error (MAE), especially in the near-infrared (NIR) and shortwave-infrared (SWIR) regions. These results emphasize the importance of species-specific modeling approaches and show that integrating biochemical and structural traits into data-driven architectures can significantly improve spectral prediction. The proposed model provides a robust framework for generating accurate leaf-level reflectance data, with potential applications in canopy trait retrieval, vineyard monitoring, and remote sensing-driven crop management.
[LG-70] Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization
链接: https://arxiv.org/abs/2606.01425
作者: Gishnu Madhu,Feng Liu,Souma Chowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted for presentation at 2026, ASME IDETC
Abstract:Mixed-combinatorial nonlinear programming (MCNLP) problems arise in many engineering design and planning applications, e.g., due to categorical, component, and geometric design choices, as well as joint task and motion planning. Traditional representations of combinatorial spaces, such as integer or binary encoding, often introduce spurious relations, increase dimensionality, and require additional compatibility constraints. Instead, this paper draws on recent developments in robot planning and vehicle/network routing domains that aim to learn search heuristics over combinatorial spaces using graph neural networks (GNNs). More specifically, this paper presents a first-of-its-kind structured abstraction of the combinatorial space by learning a mapping from an undirected fully connected graph of combinations to a directed graph indicating improvement directions using an Edge Field Graph Network (EFGN). To demonstrate the utility of this new way of abstracting the combinatorial space in solving MCNLPs, we adopt a recent optimization framework that purely searches over the non-combinatorial (e.g., continuous) variables and retrieves the best-suited combination for each candidate design by using the abstraction model, akin to a recommender system. The presented direction-aware abstraction model provides a potentially more scalable and interpretable retrieval of combinations compared to the original recommendation system in that framework. For evaluation, the proposed method is integrated with a well-known particle swarm optimization and genetic algorithm solvers on three benchmark nonlinear problems with varying numbers of combinations and variables. Compared to baseline solvers using indexified combinations, the GNN-based recommender consistently achieves better mean optimum values and robustness across multiple runs.
[LG-71] arget localization identification and sensing using latent symmetries
链接: https://arxiv.org/abs/2606.01421
作者: David Dukov,Malte Röntgen,Bryn Davies
类目: Machine Learning (cs.LG)
*备注: Submitted to SIAM Journal on Imaging Sciences
Abstract:We show that an array of scatterers which has been designed to have latent (“hidden”) symmetries can be used as a sensor. We use the capacitance matrix as a canonical model for three-dimensional hybridisation and study how the introduction of an "intruder’’ scatterer breaks the latent symmetries. By analysing the degree to which each symmetry is broken, we identify the radius of the intruder and localize its position. This can be achieved using a dictionary-based approach, however Bayesian inference or an artificial neural network (multi-layer perceptron) perform better in the presence of measurement noise. To our knowledge, this is the first time latent symmetries have been exploited successfully for sensing problems. It is also the first time latent symmetries have been observed in a three-dimensional open system that cannot be approximated by a sparse graph.
[LG-72] GPT Q-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation
链接: https://arxiv.org/abs/2606.01412
作者: Shihao Zhang,Rayan Saab
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注:
Abstract:Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form W\approx Q+LR . In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective |XW-X(Q+LR)|_F^2 , where X is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice L=V_r , where V_r contains the top right singular vectors of X , we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on |X|_F^2 is replaced by the rank- r residual |X-X_r|_F^2 , up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.
[LG-73] Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision
链接: https://arxiv.org/abs/2606.01397
作者: Mehmet Iscan,Batuhan Temiz
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper
Abstract:A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.
[LG-74] urning Back Without Forgetting: Selective Backward Refinement for Parameter-Efficient Continual Learning
链接: https://arxiv.org/abs/2606.01379
作者: Anushka Tiwari,Kaiyi Ji
类目: Machine Learning (cs.LG)
*备注:
Abstract:While prompt-based parameter-efficient continual learning mitigates catastrophic forgetting by isolating task-specific prompts, this isolation also limits later tasks from improving earlier ones, leaving backward knowledge transfer underexplored. We address this limitation by proposing Selective bAckward refinement for positive Backward knowledge transfER (SABER), a replay-free framework that enables controlled backward transfer in prompt-based continual learning. SABER determines when backward refinement is beneficial using complementary task-correlation criteria based on prompt-gradient geometry and loss-distribution similarity, and how to perform refinement safely by restricting updates to non-interfering directions in the prompt parameter space. Extensive experiments across multiple continual learning benchmarks and diverse pretrained backbones, including T5-Large, LLaMA, and Qwen, demonstrate that SABER consistently achieves positive backward transfer while maintaining strong overall average performance. Code is available at this https URL.
[LG-75] From Performance to Viability: A Bootstrap Framework for Latent-Space Representation Learning in Adaptive Biological Systems
链接: https://arxiv.org/abs/2606.01374
作者: Jacques Raynal,Pierre Slangen,Elsa Raynal,Jacques Margerit
类目: Machine Learning (cs.LG)
*备注: 25 pages. Methodological framework for latent-space representation learning in adaptive biological systems
Abstract:Observable performance is commonly used to characterize biological systems. In adaptive systems, however, similar performances may arise from distinct organizations, and configurations that appear comparable at a given time may follow different longitudinal trajectories. This limitation motivates a methodological framework for moving beyond performance-based interpretation without assuming a complete mechanistic model in advance. This article proposes a bootstrap framework for latent-space representation learning in adaptive biological systems. Here, bootstrap is used in a methodological and epistemological sense: new analytical levels are introduced when the preceding representation becomes insufficient to account for observed adaptive dynamics. The framework is organized around five levels: observable performance, dynamic organization, latent organization, longitudinal viability, and internal predictive approximation. The framework is illustrated by three previously reported gait–occlusion studies, used here only as a methodological case sequence and not as new experimental evidence. The article formalizes how performance analysis led to latent organization, how static latent organization led to longitudinal viability, and how observed viability led to internal predictive approximation. The contribution is not a new learning algorithm, clinical protocol, or dataset, but a bootstrap framework for latent-space representation learning describing how increasingly informative representations can emerge from observational insufficiencies in adaptive biological data.
[LG-76] All Models are Wrong Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning
链接: https://arxiv.org/abs/2606.01363
作者: Bernd Frauenknecht,Devdutt Subhasish,Artur Eisele,Friedrich Solowjow,Sebastian Trimpe
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccuracies of the learned dynamics model are typically exploited by the agent, substantially hampering the capabilities of MBRL methods. We present a framework for dealing with inaccuracies of probabilistic models through targeted handling of uncertainty that effectively mitigates model exploitation. We present recent successes in learning directly on hardware and safe exploration, and discuss future directions for uncertainty-aware MBRL.
[LG-77] owards Optimal Robustness in Learning-Augmented Paging ICML2026
链接: https://arxiv.org/abs/2606.01342
作者: Peng Chen,Hailiang Zhao,Xueyan Tang,Yixuan Wang,Shuiguang Deng
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Learning-augmented paging has been extensively studied in recent years. A key advantage over naive ML-based approaches is \emphbounded robustness, which guarantees worst-case performance even when predictions are inaccurate, making these algorithms valuable for real-world systems. Prior work achieves robustness bounds of 2H_k + O(1) in the randomized setting, leaving a gap to the optimal competitive ratio H_k . In this paper, we study how to close this gap. We begin by reviewing online optimality and proving a new property of the latest H_k -competitive algorithm, which facilitates our analysis in the learning-augmented setting. Then, we review existing learning-augmented paging algorithms and introduce a unifying primitive, the \emphrelative prediction budget, which captures the essence of establishing robustness and reveals that prior algorithms either overuse or underutilize predictions. Guided by the above analysis, we develop a new framework that achieves the best-possible robustness up to an additive constant for learning-augmented paging: H_k + O(1) . Experiments further demonstrate strong practical performance. Comments: ICML 2026 Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG) Cite as: arXiv:2606.01342 [cs.DS] (or arXiv:2606.01342v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2606.01342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-78] Sample Complexity and Decision-Theoretic Guarantees for Bayesian Model Averag ing over Decision Trees with Catalan-Exponential Priors
链接: https://arxiv.org/abs/2606.01340
作者: Livija Jakaite,Vitaly Schetinin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, 3 figures, Submitted to the Journal of Machine Learning Research
Abstract:We ask: when do Bayesian model averaging (BMA) weights over decision trees carry sufficient epistemic information to justify committed exploitation of the averaging distribution? We answer this question in closed form for Bayesian decision trees (BDTs) with Dirichlet-Multinomial leaf models and a Catalan-exponential tree-size prior (SchetininJakaite, 2025), establishing a complete non-asymptotic theory of rational commitment thresholds.
[LG-79] Conditioned free-energy density of proteins using unbalanced solutions to constraint satisfaction problems
链接: https://arxiv.org/abs/2606.01329
作者: Pratik Worah,Subhash Khot,Srinivasa Varadhan
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:We show that computing the log-partition function (free-energy) of conditioned inhomogeneous Curie–Weiss spin Hamiltonians reduces to an unbalanced 2 \to 1 norm computation, and design a polynomial-time SDP algorithm for this problem with a lower bound proof for the amount of unbalance achieved. Applied to the protein Ubiquitin, the framework starts from a known crystal structure, explores alternative backbone conformations across the free-energy landscape, and identifies flexible regions of the protein while preserving its native secondary structure.
[LG-80] SEArch: Optimistic Policy Selection Between Scene Noise and Drift for UAV Radar Search
链接: https://arxiv.org/abs/2606.01325
作者: Noor Khial,Naram Mhaisen,Loay Ismail,Amr Mohamed
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Unmanned Aerial Vehicles (UAVs) equipped with radar sensors are deployed for target search missions in diverse environments, where targets exhibit characteristic signatures (e.g., respiration micro-motion in human search) detectable through occlusions. A fundamental challenge arises from shifts in radar statistics as the UAV moves through a dynamic and potentially non-stationary environment, rendering any fixed signal-processing strategy suboptimal; yet perception and adaptation must run onboard a resource-constrained aerial node in real time. Since no single detector performs well across all conditions, we adopt a multi-policy paradigm and formulate UAV target search as an online policy selection problem over a library of specialized detectors, with performance measured by regret, the cumulative loss gap relative to the best policy in each scene. The setting couples in-scene stochastic noise with inter-scene shifts. Whereas prior methods capture only one regime, we account for both through the Stochastically Extended Adversary (SEA) framework, without requiring oracle knowledge of scene dynamics. Because adaptation must run at the UAV, we instantiate SEA through \textscSEArch, a lightweight optimistic Follow the Regularized Leader (OFTRL) selector with an adaptive learning rate, achieving regret O(\bar\sigma_T \sqrtT + \sqrtJ) , where \bar\sigma_T captures radar measurement noise and J is the number of scene transitions over the mission horizon T . To enable rapid adaptation under frequent scene changes, we further introduce \textscW-SEArch, a windowed variant that restarts every w rounds and achieves regret O(\bar\sigma_I \sqrtw) under at most one transition per window. Experiments show up to 30% regret reduction compared to non-adaptive baselines across a range of non-stationary settings.
[LG-81] When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval KDD2026
链接: https://arxiv.org/abs/2606.01304
作者: Zhicheng Zhang,Jiwei Tang,Kuicai Dong,Xiaopeng Li,Jieming Zhu,Jingyu Li,Qianhui Zhu,Fengyuan Lu,Wang Jiaheng,Gang Wang,Hai-Tao Zheng,Zhaocheng Du
类目: Machine Learning (cs.LG)
*备注: Accepted at KDD 2026
Abstract:Hard negative mining has become the dominant strategy for training retrievers, yet it faces intrinsic limitations: negatives are bounded by corpus availability, selected by retriever score rather than diagnostic value, and increasingly contaminated by false positives as the retriever improves. LLM-based synthesis offers a principled alternative, where negatives that are unconstrained, targeted, and free from false positive risk. But we show that naively incorporating generated negatives into contrastive learning often degrades retrieval performance. We identify and formalize the root cause as a generative-discriminative gap: LLM generation optimizes for fluent, plausible text, while contrastive learning demands strategic violations of relevance at the decision boundary. Our analysis reveals two compounding failure modes: discriminative-agnostic generation, where the LLM lacks an explicit model of query information needs and defaults to generic or topic-drifted text that provides no contrastive signal; and source-dependent shortcuts, where distributional artifacts enable the model to distinguish negatives by origin rather than relevance, causing gradient drift that actively corrupts optimization. To close this gap, we propose CausalNeg consisting of two main modules: (1) CoT-guided counterfactual perturbation for data construction: decomposes why a document satisfies a query into explicit information requirements, then surgically violates individual requirements to construct negatives with controlled, interpretable hardness. (2) Query-view entropy maximization during training: disperses generated negatives across the similarity spectrum, minimizing the mutual information between source identity and similarity scores to suppress shortcut exploitation. We make our code publicly available at this https URL.
[LG-82] Structure and Scale in Simplicial Sequence Modelling
链接: https://arxiv.org/abs/2606.01302
作者: Matthew Farrugia-Roberts
类目: Machine Learning (cs.LG)
*备注: HiLD 2026: 4th Workshop on High-dimensional Learning Dynamics
Abstract:Modern large-scale deep learning exhibits two striking empirical phenomena: behavioural scaling laws (predictable performance gains with increasing scale) and emergent mechanisms (structured internal representations and circuits in deep neural networks). We hypothesise that these two phenomena are connected: that predictable changes in behaviour are the result of predictable changes in internal computational structure. In this paper, we report preliminary evidence of such a connection. We find a correlation between scaling patterns in performance and representations in small transformers trained to predict the outputs of a hidden Markov model, for which residual activations are known to linearly encode a belief distribution over latent states in a probability simplex.
[LG-83] Feature to Dynamics: Feature-space to Autoregression strategy for Zero-shot Time Series Forecasting
链接: https://arxiv.org/abs/2606.01289
作者: Yifan Wu,Junjie Wu,Kai Wu,Xiaoyu Zhang,Jian Lou
类目: Machine Learning (cs.LG)
*备注:
Abstract:Zero-shot time series forecasting aims to predict future values for previously unseen series, requiring models to generalize temporal dynamics beyond the training distribution. While recent foundation models achieve strong in-domain performance through large-scale pretraining, their effectiveness often relies on broad data coverage and implicit pattern memorization, which can limit generalization when data are scarce or source and target domains are disjoint. In this work, we propose FSA, a feature-to-strategy framework for controlled zero-shot univariate forecasting. Instead of directly modeling raw sequences in the observation space, FSA learns a structured mapping from an interpretable feature space to an autoregressive strategy space. This design introduces explicit inductive biases that disentangle global trends, periodic components, and local temporal dynamics, enabling the model to capture transferable time-series structure with fewer data assumptions. Empirical results show that, under identical pretraining data, training protocol, and comparable parameter budgets, FSA outperforms Transformer-based architectures in our controlled zero-shot setting.
[LG-84] AdaKernel: Learning Adaptive Kernel Parameters for Spatiotemporal Graph Neural Networks
链接: https://arxiv.org/abs/2606.01283
作者: Zhongyue Zhang,Guangyin Jin,Yuxuan Liang,Suwan Yin,Yuankai Wu
类目: Machine Learning (cs.LG)
*备注: 17 pages, 15 figures, including appendix
Abstract:Modeling spatial dependencies is central to spatiotemporal data analysis using Graph Neural Networks (GNNs). Traditional methods rely on distance-based kernels with predefined parameters, which restricts model capacity. Although generic adaptive mechanisms (e.g., Graph Attention Networks) offer flexibility, they often fail to capture the underlying geometric structure, performing worse than distance-based models in data-sparse scenarios. Addressing this, we revisit the kernel parameterization problem and theoretically prove that misspecified kernel parameters introduce unavoidable approximation errors in GNNs. To overcome this, we propose AdaKernel, a simple yet effective approach that learns adaptive kernel parameters within the neural network. Unlike methods that learn graph structures from scratch, AdaKernel adopts a structure-preserving strategy that optimizes the scale of physical interactions rather than discarding them. Extensive experiments on Kriging, Imputation, and Forecasting demonstrate that AdaKernel consistently improves various GNN architectures and outperforms model-agnostic adaptive baselines, validating that accurately learned kernel parameters are superior to both fixed priors and fully latent graph structures.
[LG-85] GLIDE: Graph-guided Leap Inference for Diffusion Estimation of Spatio-Temporal Point Processes
链接: https://arxiv.org/abs/2606.01273
作者: Guanyu Zhou,Yao Liu,Yanglei Gan,Yuxiang Cai,Peng He,Run Lin,Yuxiang Liu,Qiao Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Spatio-temporal point processes (STPPs) provide a principled framework for modeling asynchronous events in continuous time and space. Recent diffusion-based approaches offer a flexible alternative to deterministic prediction by modeling complex conditional distributions, but their application to STPPs remains challenging: reverse sampling from pure noise is costly, and weak structural constraints in sparse spatial domains can lead to poorly localized probability mass. We propose \textbfGLIDE (Graph-guided Leap Inference for Diffusion Estimation), a conditional diffusion framework for next-event modeling in STPPs. GLIDE organizes historical events into a multi-scale historical graph and encodes temporal evolution and spatial topology through a dual-stream architecture, yielding a structured conditioning context for a dual-branch diffusion denoiser. It further introduces a prior-guided leap inference mechanism, in which a lightweight mean predictor provides a deterministic anchor and the reverse process starts from an intermediate diffusion step instead of from pure Gaussian noise. Experiments on multiple real-world datasets show that GLIDE improves both distribution fitting and next-event prediction, with the largest gains appearing on the spatial side. The results also indicate that prior-guided leap inference substantially reduces reverse-sampling cost while preserving the stochastic generation capability of diffusion models.
[LG-86] raining-Free Imitation Learning with Closed-Form Diffusion Policies
链接: https://arxiv.org/abs/2606.01238
作者: Raghav Mishra,Ian R. Manchester
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.
[LG-87] DAGGER: Gradient-Free Construction of Transiently Amplifying Networks under Hard Connectivity Constraints
链接: https://arxiv.org/abs/2606.01227
作者: James C. Ferguson
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 12 pages, 7 figures
Abstract:Many networks not only support but also rely on transient non-normal amplification, an orders-of-magnitude increase in the activity of an otherwise stable system. Constructing such networks under hard sign/sparsity/diagonal constraints – the regime relevant for biological connectomes and structured RNN initializations – has so far required either gradient-based local search with thousands of inner-loop eigendecompositions or Schur-form direct construction in an abstract basis that breaks the constraints under projection. Here we introduce DAGGER (Directed Acyclic Graph Guided Edge Reweighting), a gradient-free single-pass algorithm. Given a stable signed sparse matrix, DAGGER produces an output with the same sign, sparsity, and diagonal. A single scalar \beta controls a Wasserstein-2 budget that smoothly trades exact multiset preservation ( \beta = 0 ) for amplification; peak amplification grows essentially without bound with \beta , empirically reaching 10^10 before numerical overflow. DAGGER matches or exceeds gradient-based methods at multiset preservation in a single forward pass – 30-100 \times fewer eigendecompositions than a typical gradient inner loop – and at moderate \beta beats them by orders of magnitude with connectivity exactly preserved. We develop the algorithm, compare it to the existing methods and on a downstream signal-detection task, and examine the diagnostics that show why DAGGER is structurally different from other amplifying networks. Comments: 12 pages, 7 figures Subjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC) Cite as: arXiv:2606.01227 [cs.LG] (or arXiv:2606.01227v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.01227 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-88] Riemannian Optimization for Hadamard Products of Low-Rank Matrices
链接: https://arxiv.org/abs/2606.01216
作者: Pratik Jawanpuria,Ankish Chandresh,Bamdev Mishra
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:The elementwise Hadamard product of two low-rank matrices provides a parameter-efficient model for data with multiplicative structure, but its modeling is challenging due to the presence of additional symmetries under coupled row/column scalings between the two factors. In order to leverage the geometry of the space, we formulate the learning of such matrices as optimization on a Riemannian quotient manifold. We propose a novel block-diagonal Riemannian metric derived from the pullback of the Frobenius inner product. The metric is shown to be invariant under the full symmetry group. We develop a Riemannian gradient descent algorithm that uses a tuning-free Gauss–Newton step size and scales linearly in the number of observed entries per iteration. Experiments on real and synthetic datasets illustrate the efficacy of our proposed Riemannian approach.
[LG-89] Linear Strategic Classification with Endogenous Improvements
链接: https://arxiv.org/abs/2606.01198
作者: Siddharth Shrivastava,Mahvith Akshintala,B Vamsha Vardhan Reddy,Naresh Manwani,Sujit Gujar,Ganesh Ghalme
类目: Machine Learning (cs.LG)
*备注:
Abstract:Strategic classification studies settings in which agents respond to a deployed classifier by modifying observable features at a cost. Classical models typically treat such responses as cosmetic: features may change, but true labels remain fixed. We study an improvement-aware variant in which strategic responses can induce genuine changes in outcome-relevant features. Agents choose post-deployment feature vectors strategically, and labels are then generated according to a stable conditional outcome law that preserves the relationship between features and outcomes. We formalize this problem for linear classifiers under a single-index qualification model and linear-decomposable costs. We show that the strategic-optimal classifier is obtained by a parallel shift of the Bayes-optimal decision boundary, and that it provides a better surrogate for the improvement-aware objective than the Bayes classifier. Since improvement-aware learning requires post-deployment labels, which are typically unavailable before deployment, we provide PAC-style guar- antees under an oracle model, propose a practical plug-in algorithm, establish its generalization bound, and evaluate it on synthetic and real-world datasets.
[LG-90] mporal Motif Signatures for Temporal Graph Neural Networks
链接: https://arxiv.org/abs/2606.01176
作者: Dylan Sandfelder,Mihai Cucuringu,Xiaowen Dong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real temporal interaction streams carry predictive structure in short-horizon motif patterns – repetition, reciprocity, star diversity, triadic flow – that vanilla temporal graph neural networks (TGNNs) often fail to expose to their edge scorers. We show this concretely on MOOC interaction prediction, where a small four-feature family of past-window star counts already delivers most of the lift over a strong static GNN. Across a wide set of real and synthetic temporal datasets we find that motif activity organizes consistently along three scale-stable axes (dyadic recency/reciprocity, star diversity, triadic flow), and we use this empirical structure to design a compact 13-coordinate, leakage-safe, candidate-local motif feature map h(u, v, t) that linearly embeds into any static or temporal encoder without architectural changes. A temporal Weisfeiler-Leman (WL) analysis places the augmentation relative to the first level of an anchored temporal-WL hierarchy and exhibits a candidate-anchored pair on which motif features distinguish. We demonstrate empirically that the same augmentation consistently lifts performance across heterogeneous tasks: TGB link-property prediction across all five baselines, edge classification on Bitcoin Alpha/OTC and MOOC, and graph-level classification of synthetic temporal generators.
[LG-91] Revisiting Neural Processes via Fourier Transform and Volterra Series
链接: https://arxiv.org/abs/2606.01172
作者: Peiman Mohseni,Nick Duffield,Raymond K. W. Wong
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions – especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods’ efficacy against state-of-the-art baselines.
[LG-92] Fairness in two-player zero-sum games with bandit feedback
链接: https://arxiv.org/abs/2606.01159
作者: S Akash,Pratik Gajane
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least \alpha/m . Existing instance-dependent results target \textitpure Nash equilibria, while fairness generically produces \textitmixed equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as p = (\alpha/m)\mathbf1 + (1-\alpha)\widetildep with \widetildep \in \Delta_m , and substituting into the payoff form yields p^\topAq = \widetildep^\top\widetildeA q for a fair payoff matrix \widetildeA := (1-\alpha)A + \alpha\mathbf1 c^\top , where c_j = \tfrac1m\sum_i A(i,j) is the column-mean vector. The fair game on A is then equivalent to a standard zero-sum game on \widetildeA , so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to \widetildeA . We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most \alpha(1-1/m) and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an \widetildeO(T^2/3) regret bound for an Explore-Then-Commit algorithm, \textttFair-ETC-TPZSG , applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when \widetildep^\star is a vertex of \Delta_m , the bound sharpens to instance-dependent \widetildeO(1/\widetilde\Delta(\alpha)^2) , where \widetilde\Delta(\alpha) is the LP-margin gap.
[LG-93] Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies ICML2026
链接: https://arxiv.org/abs/2606.01151
作者: Hikmet Simsir,Ozgur S. Oguz
类目: Machine Learning (cs.LG)
*备注: Accepted as a regular paper at ICML 2026
Abstract:Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: this https URL.
[LG-94] Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning
链接: https://arxiv.org/abs/2606.01128
作者: Tehila Dahan,Bassel Hamoud,Roie Reshef,Martin Jaggi,Kfir Y. Levy
类目: Machine Learning (cs.LG)
*备注:
Abstract:Communication overhead is a crucial bottleneck in scalable distributed learning. While existing methods aim to efficiently utilize data points, such as Local SGD, Minibatch SGD, and their accelerated variants, they still exhibit communication-round complexity that scales with the total number of samples N . In this paper, we introduce Local MixVR, a distributed framework that integrates local updates with variance-reduction techniques to mitigate local noise. We show that Local MixVR is the first distributed method to eliminate the dependence of communication complexity on N , achieving a complexity that scales only with the number of workers M . In common regimes where MO\left(N^1/4\right) , Local MixVR outperforms the state-of-the-art Minibatch Accelerated SGD baseline, bridging a long-standing gap in distributed optimization and establishing a new paradigm for communication-efficient training.
[LG-95] From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.01123
作者: Jun-Jie Yang,Chia-Heng Hsu,Kui-Yuan Chen,Ping-Chun Hsieh
类目: Machine Learning (cs.LG)
*备注: Published in ICML 2026
Abstract:Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at this https URL.
[LG-96] A Per-Component Diagnostic Protocol for Neural HJB-PIDE Solvers under Control-Dependent Lévy Jumps
链接: https://arxiv.org/abs/2606.01122
作者: R. Drissi
类目: Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:
Abstract:We propose a five-step diagnostic protocol for residual-trained neural HJB-PIDE solvers with control-dependent Lévy jumps, targeting a general failure mode of neural PDE methods: a learned solution can match headline scalar diagnostics while miscomputing an operator inside its training loss. The protocol pairs each neural solve with at least one from-scratch independent reference, decomposes the Hamiltonian into drift, diffusion, compensator, and nonlocal-integral components across a u-grid, and compares the value function and its low-order derivatives over a (t,x) grid before any argmax comparison. Applied to a standard CRRA-Merton-Variance-Gamma benchmark, it isolates a missing 1/2-mixture factor in the neural method’s importance-proposal density that scaled the nonlocal integral by exactly half - a textbook signature of a constant proposal scale error, invisible to longer training, grid refinement, and truncation sweeps. With the bug corrected, four references - two finite-difference solvers with disjoint discretizations, the neural solver, and a semi-analytic scalar baseline obtained from CRRA homogeneity - agree on the optimal control to within ~2%. The constant-coefficient CRRA benchmark collapses by homogeneity to a scalar maximization, so the scalar baseline is the efficient method here; the contribution is the protocol, applicable in principle to non-homogeneous and higher-dimensional settings where neural HJB-PIDE solvers are genuinely needed. The episode is a concrete instance of a broader neural-PDE verification failure: pointwise agreement of a learned value or control can coexist with a systematically wrong nonlocal operator, so per-component and surface-level checks are needed before trusting the argmax policy.
[LG-97] LeAP: Learnable Adaptive Permutation for Feature Selection in Heterogeneous and Sparse Recommender Systems
链接: https://arxiv.org/abs/2606.01111
作者: Yihong Huang,Chen Chu,Fei Chen,Yu Lin,Ruiduan Li,Zhihao Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern industrial recommender systems rely on thousands of heterogeneous features – ranging from low-dimensional scalars (e.g., statistical value) to high-dimensional embeddings (e.g., user-id embeddings, MLP representations) – to achieve high-precision predictions. Given the immense computational costs associated with training, efficient feature selection is critical. However, existing methods encounter three primary bottlenecks: (1) they typically assume uniform feature dimensions or require costly mapping to a fixed size; (2) they struggle with extreme sparsity, where the majority of features (e.g., 99%+) remain at default values; and (3) traditional permutation-based approaches are computationally prohibitive in large-scale settings. To address these challenges, we propose LeAP (Learnable Adaptive Permutation), a novel, model-agnostic plug-in module for feature selection. LeAP transforms the inefficient random permutation process into a learnable mechanism, significantly accelerating the evaluation of feature importance. In addition, we introduce an adaptive regularization strategy tailored for heterogeneous dimensions and extreme sparsity, enabling superior feature importance ranking results across asymmetric input spaces. Experiments on four public recommendation datasets demonstrate that LeAP achieves state-of-the-art performance. Furthermore, LeAP has been deployed in a large-scale industrial search ranking model with over a billion daily requests and a 2TB model parameter scale. In this real-world scenario involving 12,000+ total feature dimensions, LeAP successfully identified and removed over 3,600 redundant dimensions without performance degradation, which is 2 to 10 times the ability of compared baseline methods. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.01111 [cs.LG] (or arXiv:2606.01111v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.01111 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-98] How (and when) can you fit examples to logic-based hypothesis classes over infinite structures?
链接: https://arxiv.org/abs/2606.01107
作者: Michael Benedikt,Alessio Mansutti
类目: Logic in Computer Science (cs.LO); Machine Learning (cs.LG); Logic (math.LO)
*备注:
Abstract:We study fitting problems, sometimes called ``training problems’', where we have a finite sample consisting of inputs and outputs, and we want to know whether there is a function in a certain class that could produce these outputs, exactly or approximately, on the given inputs. We focus on the computational and descriptive complexity of fitting for logically-defined classes in common decidable structures, like the real ordered field and Presburger arithmetic, and also for broader classes defined via combinatorial or model-theoretic properties. We isolate the complexity of these fitting problems, with particular attention to cases where we can use queries in a natural query language over the sample to determine whether a sample is fittable.
[LG-99] Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback
链接: https://arxiv.org/abs/2606.01081
作者: Wyame Benslimane,Tinghan Ye,Pascal Van Hentenryck,Paul Grigas
类目: Machine Learning (cs.LG)
*备注:
Abstract:Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting. Our method learns a stochastic predict-then-optimize policy that samples a cost-vector prediction from a conditional distribution and solves the resulting downstream linear optimization problem. To update this distributional model, we introduce a two-component hybrid gradient estimator. The first component is a score function estimator, which provides an unbiased but potentially high-variance policy gradient estimate. The second is a decision-focused plug-in component that uses an auxiliary nuisance estimate of the latent cost vector to exploit the downstream optimization structure, becoming more informative as the estimate improves. We prove an \mathcalO(T^-1/2) bound on the average squared policy-gradient norm, matching the standard non-convex SGD rate. Experiments on top- k selection, shortest path, combinatorial pricing, and a real-data energy-scheduling benchmark show that the hybrid gradient approach achieves lower cumulative regret than contextual-bandit-style baselines across all benchmarks, using both Gaussian and richer conditional generative models. Code is available at this https URL.
[LG-100] Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows
链接: https://arxiv.org/abs/2606.01078
作者: Jun Hu
类目: Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注: 36 pages, includes appendix
Abstract:Transport MCMC trains a normalizing flow to precondition Metropolis–Hastings proposals, achieving high empirical efficiency on challenging posteriors; yet no prior work produces a numerically non-vacuous, rigorous spectral-gap bound for such samplers. We establish the first such bounds. For independence MH on the banana family we certify (\gamma^\ast = 0.828) at (D = 2) (covering in the original space) and (\gamma^\ast \ge 7.6\times 10^-4) at (D = 5) (covering in an analytically unwarped Gaussian space with a grid-certified gradient bound under the stated numerical Lipschitz certification), both rigorous at 95% confidence. The framework rests on three pillars: (i) spectral normalization with reduced scale clips constrains the flow Lipschitz constant from (10^47) to (10^4); (ii) a coverage-based empirical oscillation bound replaces the vacuous analytical bound with a data-dependent certificate; and (iii) oscillation-regularised training cuts the empirical oscillation by 60–90% at no cost to density fit, extending practical certificates through (D = 20) ((\gamma^\ast \ge 1.7\times 10^-4)). Tests on four further targets (Gaussian mixture, shear-building, Neal’s funnel, Bayesian logistic regression) identify three precise barriers: boundary curvature, target stiffness, and tail-coverage mismatch. An affine-vs-spline comparison shows that simpler architectures yield tighter certificates at identical NLL, inverting the usual expressiveness hierarchy.
[LG-101] Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment
链接: https://arxiv.org/abs/2606.01051
作者: Xun Shen,Yuepeng Wang,Akifumi Wachi,Yongqi Zhou,Richard Weiss,Yoshihiko Fujisawa,Ken Kawano,Mehrshad Sadria,Ying Chen,Xin Liu,Sebastien Gros,Xiao Hu,Kyoung-Sook Kim,Mengmou Li,Katsuki Fujisawa,Kenji Wakabayashi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.
[LG-102] MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning
链接: https://arxiv.org/abs/2606.01028
作者: Yuepeng Wang,Ken Kawano,Yongqi Zhou,Yoshihiko Fujisawa,Richard Weiss,Akifumi Wachi,Katsuki Fujisawa,Ying Chen,Mehrshad Sadria,Xin Liu,Kyoung-Sook Kim,Xiao Hu,Sebastien Gros,Xun Shen
类目: Machine Learning (cs.LG)
*备注:
Abstract:Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.
[LG-103] Data Enrichment for Symbolic Regression Using Diffusion Models
链接: https://arxiv.org/abs/2606.00988
作者: Simon De Reuver,Tamas Kristof Toth,Teddy Lazebnik
类目: Machine Learning (cs.LG)
*备注:
Abstract:Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.
[LG-104] Profiling Privacy Preservation Against Gradient Inversion Attacks in Tabular Federated Learning
链接: https://arxiv.org/abs/2606.00986
作者: Ivo Osterberg Nilsson,Maximilian Birr Engvall,Viktor Valadi,Teddy Lazebnik
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated learning (FL) enables multiple data holders to train machine learning models collaboratively without centralizing raw data, making it useful in privacy sensitive domains such as healthcare and institutional data sharing. FL keeps data local to clients while communicating only model updates, such as gradients or model deltas. Nevertheless, these updates can expose private client data through gradient inversion attacks (GIAs). We study this risk for tabular FL under an honest-but-curious server threat model across FL protocols, client batch sizes, training stages, attacker assumptions, model architectures, and binary classification, multiclass classification, and regression tasks. We use MIMIC-IV and complementary benchmark datasets. Our evaluation distinguishes numerical and categorical recovery, baseline recoverability, feature level recovery, and exact match rate (EMR). We evaluate FedSGD gradients and FedAvg model deltas with an exposure aligned protocol, comparing attacked models after matched client data exposure rather than matched communication rounds. We compare multilayer perceptron (MLP), ResNet, and FT-Transformer models, and isolate architecture effects through an MLP grid over width, depth, activation, normalization, and dropout. The results show that small client batches and updates representing few distinct records are most vulnerable. Larger local batches and stronger aggregation reduce reconstruction but do not eliminate leakage. FT-Transformer is consistently harder to invert than one-hot baselines, while reconstructability also varies substantially within the MLP family. These findings identify architecture as a practical privacy variable in tabular FL. We also show that aggregate reconstruction accuracy can overstate complete record recovery in sparse data, making EMR and baseline comparisons essential.
[LG-105] UME: A Unified Meta-Generalization Framework for Cross-Domain ETA
链接: https://arxiv.org/abs/2606.00979
作者: Duo Wang,Qiong Wu,Jianguo Wu,Ruiyu Xu,Jinhui Yi,Zhonggen Sun,Zhentao Zhang,Yu Zhang,Ke Xing,Yongjun Yin,Zishuo Li,Jianwen Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Accurate Estimated Time of Arrival (ETA) prediction on checkout page is crucial in instant logistics for enhancing user satisfaction, optimizing dispatching, and controlling operational costs. In international on-demand delivery platforms, where ETA data originates from diverse countries or regions with different patterns, multi-domain modeling is of great importance and has been widely adopted. However, existing methods still face three critical challenges in real-world deployment. First, current multi-domain models struggle to generalize to completely unseen domains, failing to achieve zero-shot prediction during the initial cold-start phase. Second, cross-domain feature spaces are often assumed to be consistent, whereas new domains commonly suffer from structural missingness of offline (statistical) features due to the lack of historical data. Third, such feature missingness often compels industrial systems to model mature and cold-start domains separately, hindering knowledge transfer and increasing maintenance overhead. To address these challenges, we propose \textbfUME, a \textbfUnified \textbfMeta-generalization framework for \textbfETA. Specifically, UME integrates a unified dual-branch architecture with a novel meta-learning mechanism that employs a hypernetwork-based meta learner. By leveraging domain-level knowledge and instance-level context, the meta learner empowers three meta modules to dynamically modulate feature gating, expert attention, and final prediction, capturing cross-domain correlations and facilitating intra-domain adaptation. A knowledge distillation strategy is further introduce to enhance performance. UME has now been deployed in Meituan-keeta delivery platform (the largest international food delivery platform in China). Extensive offline experiments and online A/B tests demonstrate that UME significantly outperforms existing baselines.
[LG-106] Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee
链接: https://arxiv.org/abs/2606.00956
作者: Shion Takeno
类目: Machine Learning (cs.LG)
*备注: 23pages, 3 figures
Abstract:This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, they often rely on computationally intractable approximations, and their regret guarantees remain underdeveloped. Thus, this paper proposes a one-step lookahead BO method called optimal-point variance reduction (OVR), which requires only posterior sampling and Monte Carlo approximations. We obtain a uniform error bound over an input domain for the Monte Carlo estimation in OVR. Furthermore, we show that the regularized OVR, with the slight modification to promote exploration, achieves a vanishing Bayesian expected simple regret upper bound. Finally, we demonstrate the effectiveness of OVR through numerical experiments.
[LG-107] CryoProt: A Protein Pretraining Framework with Cross-Box Interactions on Cryo-EM Density Maps
链接: https://arxiv.org/abs/2606.00955
作者: Dan Luo,Xuan Lin,Peng Zhou,Junwen Zhu,Tengfei Ma,Xiangxiang Zeng,Yiping Liu
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:
Abstract:Despite the growing availability of cryo-electron microscopy (cryo-EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general-purpose protein pretraining framework tailored for cryo-EM density maps, designed for protein-related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo-EM density map. To address these challenges, we propose CryoProt, a protein pretraining framework designed for cryo-EM density maps. CryoProt introduces a Map Encoder based on multi-head latent attention (MLA), where box-level representations interact through a shared latent space, enabling explicit modeling of cross-box dependencies within the density map. Furthermore, we adopt a multi-task pretraining strategy to learn generalizable representations that can be effectively transferred to diverse downstream tasks, such as protein flexibility prediction, where cryo-EM density maps are not required and can be inferred implicitly by the pretrained model. Experimental results demonstrate that CryoProt consistently outperforms existing state-of-the-art methods across multiple benchmarks, achieving up to 12% improvement over the best-performing baselines, highlighting the importance of modeling cross-box interactions in cryo-EM data. The source code is publicly available at this https URL.
[LG-108] COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space ICML2026
链接: https://arxiv.org/abs/2606.00950
作者: Yao Luan,Ni Mu,Hanfei Ge,Yiqin Yang,Bo Xu,Qing-Shan Jia
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in task-irrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models, and rely on pre-defined rules or expert demonstration, which can be ineffective under sparse, online-collected human feedback. To overcome this, we propose COLLIE, a GSD framework that leverages dense unsupervised data to construct a semantically coherent skill latent space. This latent space is well-structured, enabling reliable guidance with sparse online feedback. Moreover, its semantic coherence property enables training-free construction of guidance signals, eliminating the need for additional model training beyond skill learning. Theoretical analysis justifies the effectiveness of our training-free guidance signal, while experiments across diverse state-based and pixel-based tasks show that COLLIE learns diverse, human-aligned skills, avoids hazardous behaviors, and achieves superior downstream performance with minimal human feedback.
[LG-109] PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA ICML2026
链接: https://arxiv.org/abs/2606.00944
作者: Shihao Wang,Xueru Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026) as an oral presentation
Abstract:Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA’s low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix Z = AB^\top , but this factorization is inherently non-identifiable: many factor pairs (A,B) represent the same update Z . As a result, applying DP-SGD directly to the factors induces gauge-dependent perturbations on Z , and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose PRISM, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization of the effective intrinsic noise induced on Z , enabling stable privacy-utility trade-offs through bounded, gauge-invariant perturbations. We establish standard (\epsilon,\delta) -DP guarantees for PRISM and introduce a DP-aware, gauge-invariant adaptive update rule that prevents adaptive optimization from amplifying injected privacy noise, improving numerical stability in practice.
[LG-110] Machine Learning Surrogate Modeling for Homogenization of Hyperelastic Materials with Boolean Microstructures
链接: https://arxiv.org/abs/2606.00938
作者: Matthias Brändel,Oliver Rheinbach
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures
Abstract:Data-driven surrogate models are an alternative to numerical homogenization of heterogeneous materials. In this contribution, a supervised learning approach is presented for predicting effective Lamé parameters of hyperelastic composites from low-dimensional microstructural descriptors. The data set is based on previously published numerical homogenization results for ensembles of two-phase stochastic microstructures generated by planar Boolean models, covering variations of inclusion shape, phase contrast, and area fraction; see Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz and Stoyan (2022). A neural network is trained on combinations of scalar and curve-valued statistical descriptors, including the area fraction, a derived scalar shape descriptor \tau , the two-point correlation function S_2® , and the lineal-path function \ell(z) . Additional data representing limiting cases of the parameter space are incorporated to stabilize training and improve extrapolation behavior. The surrogate is evaluated by leave-one-grain-type-out cross-validation in order to assess generalization to unseen grain geometries. Numerical results demonstrate that additional descriptors can reduce relative errors. A predictor trained with \tau and S_2® provides a compact representation with good quantitative accuracy and regular dense response behavior. Adding the lineal-path function \ell(z) further reduces the error at the available data points, indicating that it is a promising additional descriptor; however, dense post-training response evaluations show that improved pointwise accuracy does not automatically guarantee physically admissible behavior between sampled parameter values. This motivates future work on physically constrained surrogate models, loss formulations, bounded output parametrizations, and a more systematic representation of curve-valued geometric descriptors.
[LG-111] Cellular Sheaf Neural Operators for Structure-Preserving Surrogate Modeling of Constrained PDEs
链接: https://arxiv.org/abs/2606.00937
作者: Lennon J. Shikhman,Shane Gilbertie
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Numerical Analysis (math.NA); Computational Physics (physics.comp-ph); Plasma Physics (physics.plasm-ph)
*备注: 41 pages, 5 figures, 3 tables
Abstract:Neural operators provide fast surrogate models for PDE simulations, but standard architectures often treat geometry and discretization as secondary to field data. Physical states are usually represented as grid-channel stacks, even when different quantities naturally belong on vertices, edges, faces, cells, boundaries, or interfaces and must satisfy compatibility constraints. We propose Cellular Sheaf Neural Operators, a discretization-aware framework for structure-preserving neural PDE surrogates. The method represents PDE states on oriented cell complexes, couples local feature spaces through learned restriction maps, and uses incidence/Hodge-informed message passing to follow computational geometry. Learned update heads pass through coboundary or flux maps, allowing selected constraints to arise from cell-complex structure rather than only from loss penalties. For magnetohydrodynamics, this yields face-based magnetic-flux updates driven by edge electromotive fields and finite-volume-style fluid updates driven by learned face fluxes and cell sources. On turbulent MHD and fusion-equilibrium surrogate tasks, the method improves structure-sensitive diagnostics, including rollout behavior, divergence control, spectral error, and equilibrium-regression accuracy. These results indicate that cellular-sheaf structure is a useful inductive bias for neural PDE surrogates in constrained multiphysics systems.
[LG-112] An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke
链接: https://arxiv.org/abs/2606.00892
作者: Thijs Stessen(University of Amsterdam)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph)
*备注: 40 pages, 16 figures, master thesis artificial intelligence
Abstract:The treatment of ischemic stroke using mechanical thrombectomy involves difficult decisions under intense time constraints. Numerical physics simulations can in theory inform operators to make better decisions regarding treatment approaches and device selection, but are too slow to do so in practice. In this thesis, we investigate if current machine learning based surrogates can accurately emulate these simulations in a step-by-step manner while making them significantly faster. To do this we train three surrogate models on two simulations that involve a simplified aspiration procedure, with varying levels of geometric complexity. Our results show that two of our models accurately predict singular simulation steps and provide substantial speedups, especially when combined with specific data augmentations. However, the models showed a lack of stability when emulating simulations with complex geometries over longer time periods. Overall, this work provides a foundation for future studies to develop stable methods that scale to realistic numerical physics simulations of mechanical thrombectomy.
[LG-113] A Lightweight Hybrid MLP-Based Framework for Real-Time Phishing URL Detection Using Structural URL Features
链接: https://arxiv.org/abs/2606.00889
作者: Uche Unoke Emmanuel,Gideon Francis Oghie
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 27 pages, 6 figures, 12 tables
Abstract:Phishing attacks remain a major cybersecurity threat, exploiting deceptive URLs to steal sensitive user information. Traditional blacklist and rule-based detection approaches are reactive and often fail to identify newly emerging phishing URLs. This paper proposes a lightweight hybrid framework for real-time phishing URL detection that combines blacklist-based screening with a Multi-Layer Perceptron (MLP) classifier operating solely on structural URL features. The framework extracts 16 URL-derived features capturing structural, domain-based, and security-related characteristics without requiring webpage content access, third-party APIs, or visual rendering, making it computationally efficient for real-time deployment. The system was trained and evaluated on the PhiUSIIL phishing dataset containing 235,795 labelled URLs. Experimental results show that the proposed MLP achieved 99.24% accuracy, 98.74% precision, 99.95% recall, 99.34% F1-score, and 99.65% ROC-AUC, outperforming Random Forest, Logistic Regression, XGBoost, LightGBM, and CatBoost under the same evaluation setting. The hybrid architecture achieved an average inference latency of 1.2 ms per URL and a peak throughput of 4,200 URLs per second under concurrent processing. A functional desktop application prototype, CyberGuard, further demonstrates deployment viability. The results indicate that the proposed framework provides an accurate and computationally efficient solution for real-time phishing URL detection in resource-constrained environments.
[LG-114] Enhancing LLM Metacognition via Cognitive Pairwise Training
链接: https://arxiv.org/abs/2606.00869
作者: Weitao Li,Hao Zhou,Xuanyu Lei,Fandong Meng,Yuanhang Liu,Jingyi Ren,Ante Wang,Xiaolong Wang,Yuanchi Zhang,Fuwen Luo,Guangwen Yang,Lin Gan,Weizhi Ma,Yang Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning–metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at this https URL.
[LG-115] Meta-Black-Box Optimization with Ensemble Surrogate Modeling for Robustness-Accuracy Trade-off within SAEA
链接: https://arxiv.org/abs/2606.00862
作者: Xiao Jin,Yongxiong Wang,Haobo Liu,Yudong Du,Yukun Du
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Surrogate-assisted evolutionary algorithms (SAEAs) have been widely used for expensive black-box optimization problems. However, their reliance on rigid and manually designed components limits their flexibility and generalization across tasks. Meta-black-box optimization (MetaBBO) provides a promising paradigm for adaptively configuring algorithmic components. Nevertheless, existing MetaBBO methods usually control only a single component, and few studies have investigated the unified control of multi-component optimizers such as SAEAs. Moreover, the robustness-accuracy trade-off in surrogate modeling, which is crucial for stable early-stage exploration and accurate late-stage exploitation, has rarely been explicitly considered. To address these issues, we propose AdaE-SAEA, an adaptive ensemble surrogate-assisted evolutionary algorithm for expensive multi-objective optimization. AdaE-SAEA embeds SAEA as the low-level optimizer within the MetaBBO framework and jointly controls the infill criterion and ensemble-based surrogate modeling. Specifically, bagging and boosting are designed as surrogate modeling modules to adaptively balance robustness and accuracy across different search phases, while the meta-policy simultaneously selects the infill criterion to enable adaptive sampling decisions. The meta-policy is trained through reinforcement learning with parallel sampling and centralized training, improving both training efficiency and transferability. Experiments on synthetic and real-world problems demonstrate that AdaE-SAEA outperforms state-of-the-art baselines and MetaBBO-based methods. We further verify the effectiveness of TabPFN as the base surrogate model for ensemble learning. To the best of our knowledge, this is the first work to unify the control of surrogate modeling and infill criteria in SAEAs while explicitly addressing the robustness–accuracy trade-off.
[LG-116] CUPID in the Model Zoo: Online Matchmaking for Selecting Your Dream LLM
链接: https://arxiv.org/abs/2606.00846
作者: Son Nguyen,Xinyuan Liu,Ransalu Senanayake
类目: Machine Learning (cs.LG)
*备注: 38 pages, 11 figures
Abstract:Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM’s responses or deployment. We propose an interaction-efficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user’s latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match well-aligned LLMs to users at a lower cost.
[LG-117] Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning
链接: https://arxiv.org/abs/2606.00837
作者: Byoungwoo Park,Utkarsh A. Mishra,Jaemoo Choi,Juho Lee,Yongxin Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Project page: this https URL
Abstract:Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.
[LG-118] Online Packet Scheduling with Deadlines and Learning
链接: https://arxiv.org/abs/2606.00835
作者: Gianmarco Genalti,Achraf Azize,Vianney Perchet
类目: Machine Learning (cs.LG)
*备注:
Abstract:Network routers that enforce Quality-of-Service (QoS) guarantees must decide, at every clock cycle, which expiring packet of information to transmit, even when the value of the packet is unknown until it is processed. We frame this problem as the Online Packet Scheduling with Deadlines (OPSD) problem under Partial Feedback: packets arrive at every clock cycle, with different deadlines, but the weights are only observed after execution. Under a stochastic assumption on the unknown weights, we explore different variants of the OPSD problem with bandit feedback. We establish a connection between our setting and the sleeping bandits problem, and set our learning goal to \alpha -regret minimization. We provide algorithms with provable \alpha -regret guarantees under different spans of slackness, distinguishing systems allowing for randomization and systems that do not. In every scenario, our algorithms achieve an \alpha -regret upper bound of \widetilde\mathcalO\left(\sqrtKT\right) , matching the lower bound for the standard bandit setting. In the practically relevant case of 2 -bounded deadline instances, where the deadline is set at most one clock cycle away from the arrival, our deterministic algorithm achieves the provably tightest possible competitive ratio. Remarkably, when the number of distinct packet types K\ge 2 is finite, it is possible to break the well-established \Phi = \frac1+\sqrt52 competitive ratio barrier and attain a tighter competitive ratio \theta_K ranging in [\sqrt2, \Phi) .
[LG-119] Partial Fairness Awareness: Belief-Guided Strategic Mechanism for Strategic Agents AAAI2026
链接: https://arxiv.org/abs/2606.00826
作者: Xinpeng Lv,Chunyuan Zheng,Yunxin Mao,Renzhe Xu,Hao Zou,Shanzhi Gu,Liyang Xu,Huan Chen,Yuanlong Chen,Wenjing Yang,Haotian Wang
类目: Machine Learning (cs.LG)
*备注: Accepted by AAAI2026
Abstract:Strategic machine learning investigates scenarios where agents manipulate their features to receive favorable decisions from predictive models. To address fairness concerns intrinsic to strategic classification, recent work has introduced group-specific fairness constraints. However, current fairness-aware approaches face a fundamental dilemma in the issue of fairness exposure: making these constraints public enables strategic manipulation and can lead to fairness reversal, while keeping them hidden may reduce social welfare and discourage genuine improvement. To fill this gap, we subsequently propose the problem of partial fairness awareness (PFA), as our theoretical analysis informs that such a dilemma can be mitigated by releasing the candidate set of fairness constraints and concealing the grounding constraint. To be specific, we introduce a belief-guided strategic mechanism, wherein agents iteratively interact with the decision system and maintain a belief distribution over the candidate set of fairness constraints. This belief-guided process enables agents, through iterative interaction and feedback, to update their belief distribution over the candidate set, thereby gradually aligning their belief with the grounding fairness constraint employed by the system. Extensive experiments on real-world and synthetic datasets demonstrate that PFA achieves lower group fairness gaps, higher acceptance of truly qualified individuals, and more stable outcomes compared to fully public or private fairness regimes.
[LG-120] A Comparative Analysis of Machine Learning Algorithms for Multi-Task Prediction of the Parameters of the Pectin Hydrolysis–Extraction Process
链接: https://arxiv.org/abs/2606.00821
作者: Mullosharaf K. Arabov,Shavkat Yo. Kholov,Zainiddin K. Muhiddin
类目: Machine Learning (cs.LG)
*备注: Preprint
Abstract:This study addresses the challenge of controlling a complex, multi-parameter technological process – pectin hydrolysis–extraction – using machine learning methods. The experimental foundation is a unique database comprising 1,000 laboratory experiments conducted under controlled conditions on seven types of plant raw material with four variable process factors (temperature 85–130 C, pressure 0.9–2.2 atm, holding time 3–10 min, pH 1.5–2.0). Four output characteristics were recorded: pectin yield, galacturonic acid content, molecular weight, and degree of esterification. To solve the multi-task regression problem, 11 algorithms were trained and compared: regularised linear models, ensemble methods (Random Forest, Gradient Boosting, XGBoost, CatBoost, Extra Trees), k-nearest neighbours, support vector regression, and a multilayer perceptron. The best results were demonstrated by CatBoost (average R-squared approximately 0.946 after hyperparameter optimisation). Feature importance analysis revealed the dominant role of the raw material type (63.6% of total importance), followed by temperature and holding time. The developed pipeline was exported in a production-ready format and deployed as an interactive web interface. The findings demonstrate that ensemble methods combined with rigorous statistical analysis and interpretable AI significantly reduce the need for physical experiments and form the basis for intelligent pectin production control.
[LG-121] OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models
链接: https://arxiv.org/abs/2606.00815
作者: Ziling Lu,Zongsheng Li,Xinke Shen,Kexin Lou,Yingyue Xin,Xiaoqi Chen,Shinan Wang,Xiang Chen,Jiahao Fan,Chenyu Huang,Xin Xu,Zhoujie Hou,Chen Wei,Quanying Liu
类目: Machine Learning (cs.LG)
*备注: 28 pages, 13 figures, 8 tables; benchmark of EEG foundation models
Abstract:Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from brain-state monitoring to human-LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets and nconsistent task protocols. Here, we introduce OmniEEG-Bench, a unified benchmark and downstream task roadmap for EEG foundation models (FMs). It organizes evaluation of EEG FMs into six task families spanning (i) signal reliability, (ii) biometrics and disease, (iii) consciousness and state, (iv) cognition and emotion, (v) naturalistic stimulus decoding, and (vi) motor and interaction, introducing a new generation of tasks not systematically benchmarked in prior EEG FM work. OmniEEG-Bench standardizes model deployment, task definitions, and metrics through a task-card specification, and unifies 54 EEG datasets with consistent evaluation protocols. We benchmark 10 representative EEG foundation models and report a leaderboard that covers diverse evaluation settings. Both pretraining dataset diversity and model size are significantly associated with better average ranks across datasets, revealing scaling-law behavior in EEG foundation models (Figure 1). These results suggest that scaling EEG foundation models requires not only larger architectures but also broader and more diverse pretraining data. The benchmark code is available at this https URL.
[LG-122] Safe-Subspace Pseudo-Label Refinement for Source-Free Graph Domain Adaptation
链接: https://arxiv.org/abs/2606.00808
作者: Yingxu Wang,Xinwang Liu,Siyang Gao,Nan Yin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Source-free graph domain adaptation (SF-GDA) aims to adapt source-trained graph models to unlabeled target graphs when source graphs are no longer accessible. A central obstacle is pseudo-label reliability: under feature and topological shifts, source-induced predictions may become confidently wrong, and indiscriminate self-training can amplify systematic errors through graph message passing. This paper studies SF-GDA from a selective pseudo-labeling perspective. Instead of assuming globally bounded pseudo-label noise over the entire target domain, we identify a confidence-consistent safe subspace on which pseudo-label noise can be controlled under restricted posterior discrepancy, and derive a target-risk decomposition that separates safe-subspace fitting error, selected-label noise, and uncertain-set risk. Guided by this analysis, we propose SafeSubspace Pseudo-Label Refinement (S ^2 PLR), a source-free graph adaptation framework that applies hard pseudo-label supervision only to target graphs supported by both semantic and structural evidence. Specifically, S ^2 PLR estimates semantic reliability using source-committee confidence and disagreement, learns a targetintrinsic structural representation via graph contrastive learning, verifies pseudo-labels through neighborhood consistency, and exploits the remaining uncertain samples with noise-tolerant soft regularization rather than unreliable hard labels. Experiments on image and real-world graph benchmarks under different domain shifts demonstrate that S ^2 PLR achieves robust and competitive performance across diverse source-free transfer settings.
[LG-123] Latent Diffusion Pretraining for Crystal Property Prediction ICML2026
链接: https://arxiv.org/abs/2606.00776
作者: Shrimon Mukherjee,Kishalay Das,Partha Basuchowdhuri,Pawan Goyal,Niloy Ganguly
类目: Machine Learning (cs.LG)
*备注: Published in ICML 2026
Abstract:Fast and accurate prediction of crystal properties is a central challenge in new materials design. Graph neural networks and Transformer-based models have emerged as powerful tools for this task due to their ability to encode the local structural environment of atoms within a crystal. However, these models are data-hungry, and in practice, labeled data for crystal properties are scarce. Pretraining-finetuning strategies, particularly those based on diffusion models, have shown promise in addressing these limitations. In this work, we introduce a novel latent diffusion based pretraining framework, CrysLDNet, designed to mitigate data scarcity. Our approach integrates a Variational Autoencoder (VAE) with a diffusion model during the pretraining stage. The VAE encoder maps 3D crystal structures into a smooth latent space within which the diffusion process is applied. This latent diffusion pretraining enables the graph encoder to effectively capture structural and chemical semantics from large-scale unlabeled data, which can then be finetuned for specific property prediction tasks. Comprehensive experiments on popular DFT datasets for property prediction reveal that CrysLDNet significantly outperforms both training-from-scratch and pretrained baselines, with improvements of 4.26% and 4.90% on the JARVIS and MP datasets, respectively. Additionally, the learned representations remain robust in sparse-data conditions and are expressive enough to correct DFT errors when finetuned with limited experimental data. Code is available at: this https URL.
[LG-124] Distributed GNEP Algorithms without Multiplier Sharing and Applications to Multi-Robot Coordination and Contextual Bandit-Based Active Learning
链接: https://arxiv.org/abs/2606.00759
作者: Shao-An Yin
类目: Machine Learning (cs.LG)
*备注: 136 pages, 14 figures
Abstract:Recent advances in artificial intelligence have expanded the focus from classical optimization to include equilibrium analysis in noncooperative games. Many such games involve shared constraints, leading to Generalized Nash Equilibrium Problems (GNEPs). Existing distributed algorithms typically require agents to exchange Lagrange multipliers to enforce consensus and compute variational-GNEs (v-GNEs). This work introduces fully distributed continuous-time algorithms and establishes convergence without requiring multiplier exchange, thereby reducing information exchange per iteration while improving privacy preservation. The analysis focuses on strongly monotone games with convex individual constraints and linear shared constraints. I also propose several discretization schemes for the continuous-time algorithms. The proposed approach converges to general GNEs, rather than being restricted to v-GNEs, with the attained equilibrium depending on the initialization. The effectiveness of the proposed method is demonstrated through applications in multi-robot coordination and placement. In the second part, this work includes research conducted in collaboration with Amazon scientists. One of the most challenging problems in real-world machine learning is labeled data collection, which typically requires substantial human effort and cost. Active learning aims to reduce this labeling requirement. Existing handcrafted active learning strategies, however, generally perform well only on specific types of datasets, which are often unknown in advance. In this work, I propose using contextual bandits to adaptively select the most suitable active learning strategy. The effectiveness of the proposed approach is demonstrated on publicly available external datasets. Comments: 136 pages, 14 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.00759 [cs.LG] (or arXiv:2606.00759v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-125] RADE: Random Add-Drop Edge as a Regularizer ICML2026
链接: https://arxiv.org/abs/2606.00757
作者: Danial Saber,Amirali Salehi-Abari
类目: Machine Learning (cs.LG)
*备注: 27 pages, ICML 2026
Abstract:Graph Neural Networks (GNNs) suffer from overfitting and over-squashing of long-range information. Stochastic graph augmentations (e.g., edge deletion) regularize training against overfitting but can introduce train-inference misalignment and do not improve over-squashing. In contrast, rewiring methods improve connectivity to mitigate over-squashing, but are not designed to regularize training. We propose Random Add-Drop Edge (RADE), a stochastic graph augmentation method that jointly drops and adds edges to address both overfitting and over-squashing simultaneously. RADE is provably designed to align training and inference so that random augmentations regularize training without distribution shift, while supporting long-range communication at inference. We further propose and study a mini-batch gradient-norm balancing algorithm that adapts deletion and addition rates during training, rendering RADE hyperparameter-free in practice. Experiments on node- and graph-classification benchmarks show that RADE is a strong regularizer and mitigates over-squashing. Ablations support the roles of train-inference alignment, adaptive rate selection, and the complementary effects of random edge deletion and edge addition.
[LG-126] Score times Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation
链接: https://arxiv.org/abs/2606.00739
作者: Yun-Chen Cheng,Che-Yu Lin,Cheng-Lin Yang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward model. We ask what can be done with only a base language model: which intrinsic signal best identifies correct outputs, and how should it be decoded? We cast this as a score~ \times ~decoder grid pairing four scores (perplexity, contrastive, power-distribution likelihood, and self-verification) with three decoding families (optimization, sampling, consensus), and evaluate every cell on MATH500 with the base and instruction-tuned Qwen3-1.7B. While self-verification, which prompts the model to judge its own answer and is sharpened by a training-free virtual-thinking prefix, works well in most settings, no score has a fixed quality: its value depends on the decoder that consumes it and on model capability. When no supervision is available, the score and the decoding family must be chosen together.
[LG-127] ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving
链接: https://arxiv.org/abs/2606.00735
作者: Seokjin Go,Marko Scrbak,Ephrem Wu,Srilatha Manne,Divya Mahajan
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.
[LG-128] Graph Transfer Learning via Shared Latent Geometry: Theory and Applications
链接: https://arxiv.org/abs/2606.00716
作者: Tong Wu,Andrew Campbell,Anna Scaglione
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Inference and control in engineered physical systems pay a heavy physics cost at deployment: state estimators, inverse-problem solvers, model-predictive controllers, schedulers, and observers are often not closed-form and must re-solve a numerical optimization per instance, with the operator re-supplied each time. Physics-informed learning moves this cost to training, but uses a single encoder pathway whose latent geometry de-learns under fine-tuning and admits no quantitative transfer guarantee. We propose an asymmetric two-pathway architecture that resolves both issues. A teacher encoder consumes privileged dense states from a high-fidelity simulator and represents the system through operator-polynomial features stable under spectral perturbation; a student encoder learns the same latent geometry from sparse field data and operator descriptors. At deployment the teacher is discarded, and the frozen student runs in a single forward pass with a transfer certificate. The design connects to privileged-information learning, knowledge distillation, and cross-modal distillation, but targets cross-instance transfer rather than fixed-instance prediction: topology and operator may change, while the latent task does not. We establish sufficient and near-necessary transfer conditions via Wasserstein proximity between latent laws, yielding a zero-shot error bound, and develop a finite-sample certification protocol with active expansion when coverage is incomplete. The framework applies wherever a system admits an operator with reportable spectrum. On power-system estimation, it achieves zero-shot transfer to 100 unseen topologies, a 95% certificate pass rate, accuracy competitive with topology-aware Newton–Raphson, and sub-millisecond inference. These results suggest asymmetric pathways plus operator-anchored latent geometry provide a foundation for certified zero-shot inference and control.
[LG-129] DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction ICML2026
链接: https://arxiv.org/abs/2606.00690
作者: Enver Menadjiev,Jihyeon Seong,Jisu Yeo,Jaesik Choi
类目: Machine Learning (cs.LG)
*备注: ICML 2026 (34 pages, 12 figures, 16 tables)
Abstract:Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and distributional shifts. While recent methods attempt to approximate exchangeability through reweighting, identifying optimal weights remains an open challenge. To address this limitation, we propose DistMatch, a binning-based method that recursively partitions residuals within a binary tree using the Kolmogorov-Smirnov (KS) statistic. We theoretically show that this partitioning induces approximately exchangeable leaves, thereby avoiding the need for reweighting. By applying quantile regression with online updates within each leaf, DistMatch enables locally adaptive inference and improves robustness to distributional shifts. Extensive experiments demonstrate that DistMatch outperforms existing sequential CP methods.
[LG-130] Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing
链接: https://arxiv.org/abs/2606.00686
作者: Maryam Hashemzadeh,Jerry Huang,Minseon Kim,Marc-Alexandre Côté,Sarath Chandar
类目: Machine Learning (cs.LG)
*备注:
Abstract:The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach fundamentally constricts the model’s epistemological scope, resulting in over-cautious systems that output uninformative blanket refusals to sensitive yet benign queries. In this work, we challenge the orthodoxy that unsafe data must be discarded. We propose a dialectical approach to alignment, positing that unsafe data encodes rich, domain specific knowledge critical for nuanced, safe, and informative generation. To operationalize this, we introduce SafeMoE, a Mixture-of-Experts (MoE) framework that isolates unsafe knowledge into domain-specific Low-Rank Adapters (LoRA experts) trained exclusively on harmful corpora. To synthesize safety from these unsafe primitives, we train a lightweight gating network using a minimal, highly curated set of safe-informative responses. During inference, this router dynamically orchestrates the unsafe experts, effectively steering the generation trajectory to harness their deep domain knowledge while strictly enforcing safety constraints. Extensive empirical evaluations across stringent safety benchmarks demonstrate that SafeMoE is not only safer, achieving over a 20% relative improvement in safe response rate (more than a 15% absolute gain), but also produces more informative responses when safety and harmfulness are of paramount concern. Furthermore, the routing mechanism exhibits strong zero-shot generalization to unseen domains and broader safety tasks without domain-specific supervision. Our findings suggest a paradigm shift in alignment: true safety requires not the masking of unsafe knowledge, but its controlled integration.
[LG-131] Prior-Guided Multi-Omic Transformers for Single-Cell Gene Regulatory Network Inference KDD2026
链接: https://arxiv.org/abs/2606.00685
作者: Tianyang Xu,Tianci Liu,Niraj Rayamajhi,Ryan Patrick,Kranthi Varala,Ying Li,Jing Gao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 6 figures. Accepted to the KDD 2026 AI4Sciences Track
Abstract:Gene regulatory networks (GRNs) capture transcription factor-target interactions and are central to understanding cell-state regulation and disease. Reconstructing GRNs from paired single-cell transcriptomic and chromatin accessibility data is promising but challenging: scATAC is extremely sparse, and most methods rely on fixed peak-to-gene links and weak supervision. We present EpiAwareNet, a prior-guided multi-omic Transformer framework that reconstructs GRNs from paired single-cell data using only lightweight biological priors. In Stage 1, EpiAwareNet learns joint gene-peak representations with a gene-peak cross-attention module, enabling data-driven, gene-specific aggregation of accessibility signals rather than hard-coded peak-to-gene assignments. In Stage 2, EpiAwareNet incorporates a bulk-derived GRN prior as noisy positive edges to provide weak supervision under label scarcity, refining regulatory scores while remaining robust to prior noise. In our experiments, EpiAwareNet improves GRN reconstruction over representative single- and multi-omic baselines and yields GRNs with greater biological plausibility, such as improved recovery of known regulatory interactions, suggesting that lightweight biological priors from bulk data can effectively guide single-cell GRN inference when combined with adaptive cross-modal representation learning. Code and data will be available at this https URL.
[LG-132] Limits of Resolution Equivariance in Fourier Neural Operators ICLR2026
链接: https://arxiv.org/abs/2606.00677
作者: Alex Colagrande,Paul Caillon,Eva Feillet,Alexandre Allauzen
类目: Machine Learning (cs.LG)
*备注: Published as a paper at AIPDE: ICLR 2026 Workshop on AI and Partial Differential Equations. 6 pages, 2 figures
Abstract:Fourier Neural Operators are often assumed to generalize across spatial resolutions, enabling training on a coarse grid and deployment on a finer grid. We test this assumption by contrasting two inference-time choices when moving from training resolution s to test resolution Ss : running FNO directly at S , or running at s and upsampling the prediction to S via Fourier zero-padding. On Darcy flow, we observe that direct fine-grid inference is not reliably beneficial and can be worse than the low-grid-plus-upsampling baseline. We further analyze layerwise spectra and find that, under Fourier truncation, intermediate representations increasingly concentrate energy in low frequencies, with high-frequency output produced mainly by late nonlinear/decoder stages. This offers a mechanistic explanation for why FNO can perform well while retaining few modes, yet remain sensitive under resolution shifts. Our findings highlight a simple but strong baseline for cross-resolution evaluation and point to nonlinear aliasing as a key obstacle to zero-shot resolution equivariance.
[LG-133] Mapping the evolution of small reservoirs in Brazil from 1984 to 2025 using deep learning
链接: https://arxiv.org/abs/2606.00675
作者: Kylen Solvik,Luis Gustavo Carvalho,Marcia N. Macedo
类目: Machine Learning (cs.LG)
*备注: 33 pages, 5 figures, 2 tables
Abstract:Water research in Brazil largely overlooks the widespread damming of small streams for agricultural uses such as watering cattle, farm-scale hydropower, irrigation, and aquaculture. These ubiquitous dams and their reservoirs can alter water temperature, stream connectivity, aquatic habitats, greenhouse gas emissions, and evaporative water losses. Mapping small reservoirs is challenging because it requires reliably detecting small water bodies and distinguishing artificial reservoirs from natural lakes. As a result, most regional and global datasets exclude them. To address this gap, we trained a deep learning computer vision model to accurately segment small ( 1 km^2 ), stream-fed, surface water reservoirs in Brazil leveraging data from Landsat 5-9. Applying our model from 1984 to 2025, we created annual reservoir maps for the entire country to evaluate how their count, size, and distribution have changed over time. The number of detected reservoirs grew nearly fourfold from 263,913 to 996,245, while their total surface area increased from 3510 km^2 to 8550 km^2 . To our knowledge, this is the first country-wide annual dataset representing the evolution of small reservoirs over four decades. The publicly available annual maps highlight the extent and cumulative impacts of the small stream impoundments across Brazil, providing actionable insights for managing freshwater ecosystems and water resources.
[LG-134] How Neural Losses Shape VAE Latents
链接: https://arxiv.org/abs/2606.00635
作者: Giorgio Strano,Luca Cerovaz,Michele Mancusi,Tommaso Mencattini,Emanuele Rodolà
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern VAEs are rarely trained with the pointwise likelihood implied by the standard \beta -VAE objective. In practice, pointwise reconstruction is often combined with perceptual and adversarial losses, despite a lack of understanding of how this changes the latent dynamics of the model. We show that the choice of reconstruction loss reshapes the rate-distortion problem itself, altering both the information content and the geometry of the learned latent space in ways that may be invisible from reconstructions alone. First, we prove and verify empirically that augmenting pointwise reconstruction with neural terms, such as perceptual and adversarial objectives, reduces the amount of information stored in the latent representations. Second, we show that neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. These findings highlight how the rate-distortion tradeoff is not a comprehensive lens to understand the behavior of VAEs, and we propose a more mechanistic approach to investigate how the choice of a distortion metric reshapes the optimization problem.
[LG-135] Looped Transformers with Layer Normalization Provably Learn the Power Method
链接: https://arxiv.org/abs/2606.00605
作者: Lyumin Wu,Chenyang Zhang,Yuan Cao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 70 pages, 8 figures
Abstract:Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an “algorithmic implicit bias” of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.
[LG-136] LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models
链接: https://arxiv.org/abs/2606.00573
作者: Haiyu Wang,Yutong Wang,Leshu Li,Yihui Ren,Sai Qian Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textitLASER (\textbfLoss-\textbfAware \textbfSingular-value d\textbfEcomposition and \textbfRank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than 2.3\times decoding speedup over previous work while preserving strong accuracy under low-precision inference.
[LG-137] Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction
链接: https://arxiv.org/abs/2606.00572
作者: Oluwaleke Yusuf,Adil Rasheed,Frank Lindseth
类目: Machine Learning (cs.LG)
*备注: 25 pages, 7 figures, 11 tables, including appendix. Code available at this https URL
Abstract:Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non-linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop-level formulations, limiting their ability to capture within-trip evolution and network context. This study proposes SMT-GraphFormer, a spatiotemporal multi-task graph transformer that frames trip-level transit prediction as sequence-to-sequence modelling. Given a line’s stop sequence and trip-level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder-side surrogate tasks. Key components include graph embeddings for multi-relational stop similarity, a context encoder for weather and temporal information, and a multi-gate mixture-of-experts module that produces task-specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT-GraphFormer outperforms stop-level tabular benchmarks, with ablation studies examining each component’s contribution. The sequential formulation yields substantial gains on alighting prediction ( + 0.24 in R^2 ) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip-level sequential bias and inter-target dependencies. These findings demonstrate the potential of transformer-based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off-the-shelf tabular models. The proposed framework provides a horizon-agnostic basis for scenario analysis in digital twin environments, supporting informed decision-making by planners and transit operators.
[LG-138] On the Recoverability of Causal Relations from Bulk Gene Expression Data
链接: https://arxiv.org/abs/2606.00568
作者: Gongxu Luo,Boyang Sun,Kun Zhang
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:
Abstract:Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.
[LG-139] Semi-Supervised Noise Adaptation: Transferring Knowledge from Noise Domain ICML2026
链接: https://arxiv.org/abs/2606.00558
作者: Yuan Yao,Jin Song,Huixia Li,Tongtong Yuan,Jiaqi Wu,Yu Zhang
类目: Machine Learning (cs.LG)
*备注: Accepted by ICML 2026
Abstract:Transfer learning aims to facilitate the learning of a target domain by transferring knowledge from a source domain. The source domain typically contains semantically meaningful samples (e.g., images) to facilitate effective knowledge transfer. However, a recent study observes that the noise domain constructed from simple distributions (e.g., Gaussian distributions) can serve as a surrogate source domain in the semi-supervised setting, where only a small proportion of target samples are labeled while most remain unlabeled. Based on this surprising observation, we formulate a novel problem termed Semi-Supervised Noise Adaptation (SSNA), which aims to leverage a synthetic noise domain to improve the generalization of the target domain. To address this problem, we first establish a generalization bound characterizing the effect of the noise domain on generalization, based on which we propose a Noise Adaptation Framework (NAF). Extensive experiments demonstrate that NAF effectively leverages the noise domain to tighten the generalization bound of the target domain, leading to improved performance. The codes are available at this https URL.
[LG-140] Normalized Relevance Measure as a Unifying Framework to Explain Neural Network Latent Structures
链接: https://arxiv.org/abs/2606.00557
作者: Ping Xiong,Thomas Schnake,Grégoire Montavon,Klaus-Robert Müller,Shinichi Nakajima
类目: Machine Learning (cs.LG)
*备注:
Abstract:To understand how a neural network (NN) functions and makes predictions, it has become increasingly clear that analyzing only the input domain is insufficient – one must also examine its internal inference mechanisms to capture the complete picture. To explain the internal inference mechanisms of such models, it is essential to analyze the importance of latent representations for a given task. In this paper, we propose the \emphnormalized relevance measure (NRM) framework – a novel general explanation procedure that attributes relevance to \empharbitrary sets of neurons across layers of arbitrary architectures. In the NRM framework, relevance of selected neurons is explicitly defined as a normalized signed measure, constructed using simple operations – marginalization and conditioning based on additive and multiplicative laws – in analogy to the probability measures. The normalization property further guarantees comparability across layers. The NRM framework subsumes existing propagation-based explanation algorithms by explicitly identifying the underlying quantity being computed. We demonstrate the utility of the framework in computer vision applications, where joint relevance analysis across multiple layers reveals key information flows in VGG16 networks. Overall, the NRM framework provides a general, mathematically grounded approach to understanding how modern NNs propagate information, offering a versatile and broadly applicable foundation for explainable artificial intelligence.
[LG-141] he Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition
链接: https://arxiv.org/abs/2606.00545
作者: Asvin G
类目: Machine Learning (cs.LG)
*备注: Project out of Anthropic Fellows
Abstract:Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citepjack2026twomodes we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emphFirst, on the Assistant’s own row of the matrix, the Assistant’s claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant’s surprise on a persona’s text and the persona’s surprise on its own text are all tightly coupled. This extends the entropy signature of \emphacting from the companion paper to a retrospective signature of \emphhaving acted. \emphSecond, this coupling fails off the Assistant’s row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric – the evaluator’s surprise compared to the Assistant’s surprise on the same text, not to the generator’s. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citetchen2025persona (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test. Comments: Project out of Anthropic Fellows Subjects: Machine Learning (cs.LG) Cite as: arXiv:2606.00545 [cs.LG] (or arXiv:2606.00545v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00545 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-142] Rethinking Bregman Divergences in Kronecker-Factored Optimizers
链接: https://arxiv.org/abs/2606.00542
作者: Bing Liu,Wenjie Zhou,Chengcheng Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Shampoo-style optimizers approximate gradient covariance matrices using Kronecker-factored structures. Recent work~\citelin2026understanding showed that such approximations can be viewed as projections under Bregman matrix divergences, leading to different Kronecker-factored preconditioners. However, it remains unclear what role the choice of divergence plays when the covariance is not exactly Kronecker-factored. We study this question through the spectrum of the covariance matrix. We show that Frobenius, von Neumann, and LogDet divergences distribute the unavoidable Kronecker approximation error differently across the covariance spectrum. We further show that their Kronecker factors are governed by divergence-weighted residuals rather than the raw approximation error, explaining how these spectral preferences are realized in the resulting preconditioners. Empirically, we observe that the top covariance eigenspace is substantially better aligned with the Hessian matrix, while the tail spectrum is much noisier and unreliable. Motivated by these findings, we propose a subspace-aware Kronecker optimizer that applies eigenvalue-based preconditioning in the top subspace and uses an adaptive isotropic acceleration constant in the bottom subspace.
[LG-143] GNMR: Runtime Stability Control for Low-Precision Large Language Model Training
链接: https://arxiv.org/abs/2606.00539
作者: Boao Kong,Weichen Jia,Engao Zhang,Guohong Li,Yonghan Dong,Yao Wang,Yaoyuan Wang,Yunke Peng,Kun Yuan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 4 figures, 15 tables
Abstract:Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit’s current gradient norm with its historical mean. Together with \Delta -GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard \mathrmmaxO budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.
[LG-144] DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
链接: https://arxiv.org/abs/2606.00535
作者: Zining Liu,Yunhai Hu,Tianhua Xia,Bo Bao,Eric Sather,Vithursan Thangarasa,Sai Qian Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textitDREAM-S, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a 3.85\times speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: this https URL .
[LG-145] Semi-Supervised Learning with Noisy Proxy Covariates: Generalization Bounds and Distribution Regression
链接: https://arxiv.org/abs/2606.00512
作者: Kwangho Kim,Jisu Kim
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:
Abstract:In many modern machine learning pipelines, abundant pretrained representations serve as noisy proxy covariates, while task-specific labels remain scarce. We study semi-supervised regression in this setting, and propose a simple two stage estimator that learns kernel eigenfeatures from all proxy covariates and fits a ridge predictor on labeled data. We derive finite sample bounds showing that fast labeled sample rates are recovered when proxy perturbation is controlled and unlabeled proxy covariates are sufficiently abundant. We also show that distribution regression is a direct special case, with analogous guarantees when the finite bag size is large enough. Experiments show consistent gains over supervised and semi-supervised baselines, especially in low label regimes.
[LG-146] Easy robust approximate message passing for planted spike models
链接: https://arxiv.org/abs/2606.00500
作者: Misha Ivkov,Tselil Schramm
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 32 pages
Abstract:We present a simple and efficient algorithm for robust approximate message passing (AMP) in the spiked matrix setting. In particular, let \varepsilon be a sufficiently small constant, and suppose that X \in \mathbb R^n \times n is a Gaussian matrix with a planted rank- 1 spike, and E \in \mathbb R^n \times n is an adversarially chosen matrix supported on an \varepsilon n \times \varepsilon n principal minor. Let v_\mathrmAMP(X) be the output of an AMP iteration on the uncorrupted matrix X . We give a procedure that, given access only to the corrupted matrix Y = X + E , computes a vector v_\mathrmALG(Y) which is \tildeO(\sqrt\varepsilon) -close to v_\mathrmAMP(X) , for any of a class of AMP iterations which includes sparse Principal Component Analysis (PCA), non-negative PCA, and \mathbb Z_2 synchronization. Our algorithm consists of a spectral pre-processing step combined with a robust spectral initialization procedure; given these inputs, we prove that (perhaps surprisingly) AMP is robust out-of-the-box.
[LG-147] orus Graphs for Large Scale Neural Phase Analysis ICML2026
链接: https://arxiv.org/abs/2606.00496
作者: Jack Goffinet,Casey Hanks,David E. Carlson
类目: Machine Learning (cs.LG)
*备注: 23 pages, 15 figures; to be published in ICML 2026
Abstract:Oscillatory neural signals such as electroencephalography (EEG) and local field potentials (LFPs) show phase relationships that coordinate communication across brain regions. Modern recordings capture hundreds of channels across many frequency bins, yet standard phase analyses are restricted to only a few variables. The Torus Graph (TG) model, an exponential-family distribution over phases whose univariate and pairwise potentials generalize von Mises distributions, infers principled structure among oscillations but models only static, undirected dependencies and is limited to \sim ! 100 variables because its score matching inference scales as \mathcalO(d^6) . We introduce a stochastic score matching procedure that reduces the per-iteration cost to \mathcalO(d^2) , enabling inference on datasets with thousands of variables. This scalable foundation supports analyses of 1,860 frequency-phase features from multi-electrode LFPs and enables two extensions previously inaccessible to TGs or classical circular statistics: (i) a TG Hidden Markov Model capturing state-dependent phase-coupling changes (e.g., spindle-related states during sleep) and (ii) an autoregressive TG inferring directional interactions via transfer-entropy estimation. Applied to LFP recordings, these models reveal state-dependent phase-interaction patterns between wakefulness and NREM sleep. Together, they enable systematic, large-scale mapping of dynamic and directional phase relationships across brain and cognitive states.
[LG-148] ProjQ: Project-and-Quantize for Adapter-Aware LLM Compression ICML2026
链接: https://arxiv.org/abs/2606.00494
作者: Wneya Yu,Chao Zhang,Li Wang,Samson Lasaulce,Merouane Debbah
类目: Machine Learning (cs.LG)
*备注: Acceppted paper in ICML 2026
Abstract:Post-Training Quantization (PTQ) and Low-Rank Adaptation (LoRA) constitute the standard pipeline for efficient Large Language Model (LLM) deployment. However, applying them sequentially poses a problem: PTQ often leaves behind random noise that is spread out (across the model’s weights) in a way LoRA can’t easily fix, meaning that LoRA ends up wasting its limited capacity trying to fix uncorrectable noise instead of improving task performance. In this paper, we propose \textbfProjQ, a novel framework for constraining quantization noise to the low-rank manifold via orthogonal subspace projection. We derive an efficient alternating algorithm that shapes the quantization noise into a low-rank structure, effectively offloading dominant error components to the subsequent adapter while minimizing the residual error in the orthogonal “uncorrectable” subspace. Our theoretical analysis demonstrates that ProjQ preserves strictly greater model plasticity for downstream tasks compared to standard PTQ. Extensive experiments on LLaMA-2, Qwen2.5 and Qwen3 confirm that ProjQ consistently outperforms existing methods in both quantization error compensation and downstream task fine-tuning, achieving up to 2\times lower evaluation loss for compensation and matching the performance of standard 4-bit baselines on language modeling tasks with only 3 bits. The code is available on this https URL .
[LG-149] Exploiting weight-space symmetries for approximating curvature ICML2026 KR
链接: https://arxiv.org/abs/2606.00442
作者: Artem Artemev,Rui Xia,Benjamin M. Boyd,Youjing Yu,Felix Dangel,Guillaume Hennequin,Alberto Bernacchia
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: Published at ICML 2026. 35 pages, 11 figures. Code: this https URL
Abstract:Many machine learning techniques rely on approximating a loss function’s curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight-space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.
[LG-150] EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing
链接: https://arxiv.org/abs/2606.00437
作者: Ibne Farabi Shihab,Fariya Afrin,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across this http URL address this gap, we introduce \textbfEST-PRM, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and this http URL results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of 0.152 \pm 0.038 and a 32.8 \pm 4.9% score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a 47.6 \pm 4.3% inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.
[LG-151] Grounded Decoding: Retrieval-Anchored Probability Fusion for Faithful RAG
链接: https://arxiv.org/abs/2606.00432
作者: Ibne Farabi Shihab,Fariya Afrin,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:As retrieval-augmented generation (RAG) systems scale, it becomes increasingly challenging to ensure faithful grounding in external evidence. Large language models may still prioritize parametric knowledge over retrieved information when conflicts arise. We propose a novel training-free decoding framework, \emphGrounded Decoding, designed to improve factual consistency in RAG without modifying model parameters. Unlike standard approaches that rely on a single conditional distribution, our method constructs two matched-prompt distributions at every generation step: (1) a full RAG distribution conditioned on the query, retrieved documents, and generated prefix, and (2) a retrieval-only distribution conditioned solely on retrieved evidence and the same prefix. The final next-token distribution is derived as the unique solution to a KL-barycenter objective over the probability simplex, yielding a normalized geometric fusion of the two this http URL formulation naturally recovers standard RAG when the grounding weight is zero and smoothly shifts probability mass toward retrieved evidence as grounding strength increases. We further introduce a conflict-aware adaptive weighting scheme that dynamically adjusts grounding based on distributional disagreement and retriever confidence. Experiments on ALCE, Natural Questions, and FActScore demonstrate consistent improvements in factual accuracy and citation quality over standard RAG and competitive decoding-time baselines, while maintaining fluency. Our results indicate that probability-level fusion provides a strong and efficient alternative to logit-level intervention methods for faithful RAG decoding.
[LG-152] Variance-sensitive Thompson sampling for generalised linear bandits revisited
链接: https://arxiv.org/abs/2606.00431
作者: Tom Perneczky,Marc Abeille,David Janz
类目: Machine Learning (cs.LG)
*备注:
Abstract:We prove a variance-sensitive regret bound for Thompson sampling in stochastic generalised linear bandits. The argument assumes a warm-up, after which the regret is controlled through using the Gaussian Poincaré inequality. This bypasses the point at which previous optimism-based analyses break down. Removing the warm-up while retaining the same variance-sensitive scaling remains open, and appears nontrivial.
[LG-153] opology-Aware State Abstraction with Tangle Cores for Markov Decision Processes
链接: https://arxiv.org/abs/2606.00427
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:State abstraction in reinforcement learning is usually formulated as a partition of states based on reward and transition similarity. This excludes a common structural pattern in navigation, graph, and hierarchical decision problems: interface states such as doors, hubs, and bottlenecks naturally participate in more than one region. We introduce \emphtangle-core abstraction, an overlapping state-abstraction framework based on graph tangles of empirical transition graphs. The method constructs abstract states from consistently oriented low-order separations and represents shared interfaces through a membership kernel rather than a hard partition. We give value-preservation guarantees for the induced overlapping abstract MDP under an explicit action-consistency condition, identify an interior-homogeneity/boundary-leakage error decomposition, and prove a quantitative interface-overlap result showing when hard partitions incur an avoidable boundary error. Empirically, tangle-core abstractions achieve favorable compression–return tradeoffs against reward-aware, learned, topological-map, and graph-partitioning baselines across bottlenecked tabular domains, procedurally generated mazes, and MiniGrid representations. We also identify a clear failure regime in which transition topology is uninformative, where tangles predictably offer little benefit. These results position graph tangles as an effective topology-aware abstraction prior for decision problems with shared interface structure.
[LG-154] Canonicalized Stable-List Replay for Private Federated Continual Learning over Language-Model Embeddings
链接: https://arxiv.org/abs/2606.00426
作者: Ibne Farabi Shihab,Abu Sa-Adat Mohamed Moon-Im Al Ahsan,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated continual learning (FCL) lets distributed clients adapt language-model heads to evolving NLP tasks without sharing raw text. Under user-level differential privacy (DP), replay-based continual learning faces a structural obstacle: clients can release only small noisy lists of candidate replay summaries, and those lists are unordered across clients. We introduce Canonicalized Stable-List Replay (CSLR), where clients privately produce candidate replay distributions over a shared sentence-embedding space and the server aligns them using signatures induced by public anchor sentences. The anchors provide identifiability for aggregation rather than additional replay data. We prove that, under an observable anchor-signature margin, O(\log(N/\eta)/p) anchors distinguish N candidate list elements with probability at least 1-\eta , and we give a scoped anchorless non-identifiability result for unordered-label oracle models. Across five seeds on continual classification, NER, and dialogue benchmarks, CSLR improves the final average task metric by 3.9–5.6 points over the strongest non-CSLR DP baseline at \eps=4 under the reported replay-release budget, while also outperforming Hungarian and optimal-transport matchers. The formal privacy guarantee covers replay release; end-to-end private training additionally requires composition with a private optimizer for task-head updates.
[LG-155] Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity
链接: https://arxiv.org/abs/2606.00414
作者: Ibne Farabi Shihab,Sanjeda Akter,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:When many reinforcement-learning policies achieve near-optimal return, a post-hoc auditor may have to distinguish among many behaviorally distinct but return-equivalent policies. We formalize this phenomenon through an occupancy-measure analogue of Rashomon capacity: the metric entropy of the near-optimal occupancy region, computed relative to an audited deployment class. Because occupancy measures identify behavior only up to occupancy equivalence, we formulate auditing at the occupancy-class level and distinguish exact local-query oracles from noisy sample-query oracles. Our main exact-query result is conditional: if the audited class contains a 2/H -separated near-optimal packing whose local signatures are b -sparse, then exact local-query auditing requires \Omega(M/b) queries; when the packing realizes deployment-class capacity and b=O(1) , this becomes \Omega(2^\Hopt^\cF(\eps)) . We give a finite discounted hidden-branch MDP attaining this bound and show the exact Bayes success law. For noisy hidden-trigger testing, we prove a mixture lower bound of order M/\beta , where \beta is the per-sample KL signal, yielding \Omega(2^\Hopt^\cF(\eps)/(\rho^2\Delta^2)) for capacity-order packings with \beta=O(\rho^2\Delta^2) . We also provide a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, and a canonical occupancy regularizer whose regularized audited capacity collapses when a trusted reference occupancy is available. Controlled benchmarks distinguish positive sparse-signature instances from high-capacity negative controls where exact auditing is easy, and map the noisy-trigger law to post-processed continuous-control and visual-RL auditing regimes.
[LG-156] Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning
链接: https://arxiv.org/abs/2606.00400
作者: Ibne Farabi Shihab,Fariya Afrin,Anuj Sharma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed replay ratios are inherently limited because the optimal mixture varies with the current domain, the training stage, and the evolving vulnerability of prior behaviors. We propose PROX-YMIX, a framework that learns a dynamic replay controller on a small proxy model and transfers the frozen controller to a larger target. The controller never observes future tasks and constructs its state from normalized validation losses and their temporal dynamics, producing a masked mixture over the current task and accessible replay buffers. Our core empirical hypothesis is forgetting mirroring: task vulnerability rankings remain largely consistent across model scales even when absolute loss magnitudes differ. We validate this assumption empirically before transferring controllers across scales. On LLaMA-3-8B across five continual instruction tuning sequences, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, and raises safety score by 5.8 points over the strongest non-oracle baseline, at roughly 50x lower policy learning cost than Oracle Target RL. The framework is leakage free and architecture independent at the interface level, and we also identify settings where the proxy assumption breaks down, highlighting limitations for robust deployment.
[LG-157] Multi-Objective Reference-Aligned Machine Unlearning
链接: https://arxiv.org/abs/2606.00399
作者: Rasa Khosrowshahli,Stephen Asobiela,Beatrice Ombuki-Berman,Shahryar Rahnamayan
类目: Machine Learning (cs.LG)
*备注: Accepted as a short paper at Canadian AI 2026. Author version with an added framework overview figure for clarity
Abstract:Machine unlearning aims to remove the influence of specific training samples while preserving the model’s utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic forgetting due to conflicting optimization dynamics and unbounded forgetting objectives that cause the model to drift from its pre-trained knowledge. We propose Reference-Aligned UnLearning (RAUL), a multi-objective framework that jointly optimizes forgetting and retention by replacing unbounded loss maximization with a bounded KL alignment of predictions on forgotten samples toward a reference distribution representing unseen data, instantiated either as a uniform distribution or an empirical distribution from a held-out reference set, which constrains the forgetting objective and reduces gradient conflict with retention. The resulting multi-objective optimization (MOO) problem is solved via Jacobian descent, which aggregates multiple gradients into a direction that does not conflict. Our results demonstrate that RAUL achieves the closest gap compared to full retraining.
[LG-158] Behavior Cloning of MPC for 3-DOF Robotic Manipulators ICRA2026
链接: https://arxiv.org/abs/2606.00383
作者: Theo Guegan,Dexter Wen Jie Teo
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references
Abstract:While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.
[LG-159] CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLM s
链接: https://arxiv.org/abs/2606.00382
作者: Kiran Nayudu,Aswini Nutakki,Sai Vinay Naidu,Ashwin Shanmugasundaram
类目: Machine Learning (cs.LG)
*备注: 38 pages, 10 figures. Patent-pending construction details deferred to companion technical report (in preparation)
Abstract:Sequential fine-tuning of large language models forces a choice: let the shared substrate keep learning and accept catastrophic forgetting, or freeze it after task one and foreclose cross-task refinement. Per-task adapter methods (LoRAHub, AdapterFusion, PackNet, Progressive Networks) take the second path. We introduce CRMA (Constrained Residual Mixing Adapter), a residual adapter whose internal mixing matrix M is doubly-stochastic at every forward pass via Sinkhorn normalization, so by Birkhoff’s theorem ||M||_2 = 1 holds by construction – a structural bound, not a penalty. CRMA’s spectrally bounded backbone provides a continuously trained shared substrate that earlier modular methods could not, while preserving their forgetting guarantees. On Mistral-7B across 5 sequential domains and 3 seeds, modular per-task LoRA on a CRMA backbone reduces loss-relative drift from +42.96% +/- 5.5 (naive sequential fine-tuning) to -0.17% +/- 0.17, with disjoint per-seed ranges, and improves prior-task holdout loss by 1.99% +/- 0.54 over a matched frozen-substrate baseline. Three independent experimental setups (Mistral-7B 4-domain controlled ablation, TinyLlama 3-domain contamination-controlled replication, Mistral-7B cross-domain probes at 7B) all show positive backward transfer – without replay buffers, without growing per-task memory, and without distillation. An inference-time ablation on Gemma-2-9B confirms CRMA mediates access to sequentially trained knowledge: 98/100 vs. 38/100 on the same weights and same questions with only CRMA injection toggled. 867 logged training steps verify ||M||_2 = 1.0 within float32 precision (max deviation 1.2 x 10^-7). The forgetting-prevention effect holds across 1.1B-9.2B parameters and four architecture families.
[LG-160] How Much Orthogonalization Does Muon Need?
链接: https://arxiv.org/abs/2606.00371
作者: Hua Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Muon optimizers improve neural-network training by replacing ill-conditioned momentum updates with approximately semi-orthogonal updates. This motivates a practical question: how much orthogonalization does Muon actually require? We study this question using a relaxed cubic Newton–Schulz schedule derived directly for Muon’s low precision singular value band. The resulting five-step cubic construction uses ten dominant matrix multiplications, compared with fifteen for five quintic Newton–Schulz iterations. The cubic schedule is not intended as a more accurate polar solver; instead, it is a principled low-cost variant that lets us probe the relation between polar accuracy, spectral shaping, and training quality. Across synthetic diagnostics, NanoGPT ablations, and training experiments on hybrid MoE/Mamba models, we find that training quality is not governed monotonically by polar-decomposition accuracy: truncated Polar Express, Muon-Jordan, cubic Newton–Schulz, and an explicit FP32 SVD polar factor can reach nearly indistinguishable final loss on GPT-2 Small, and cubic5 matches the Muon-Jordan quintic update within about 10^-3 validation loss on hybrid MoE/Mamba models with one billion to four billion parameters. These results support cubic5 as a practical low-cost Muon orthogonalization variant, with empirical evidence of training-quality parity in the settings tested.
[LG-161] Quantifying the Salience of Geo-Cultural Values for Pluralistic Safety Alignment ICML2026
链接: https://arxiv.org/abs/2606.00369
作者: Arkadiy Saakyan,Charvi Rastogi,Lora Aroyo
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 119 pages, 13 figures. ICML 2026 camera ready
Abstract:Safe global deployment of AI models requires alignment with human values that vary across cultures. Yet rater pools in safety evaluation datasets remain largely geographically homogeneous, failing to capture geo-cultural differences. Further, it remains unclear whether such differences persist after controlling for demographics such as age, gender, and ethnicity. Through a meta-analysis of safety datasets, we find that most do not report geo-cultural information, and those that do lack a unified methodology to jointly analyze geo-cultural and demographic correlates. Using the Inglehart-Welzel dimensions of cross-cultural variation, we demonstrate via multilevel modeling that cultural zone membership explains variance in safety ratings beyond standard demographics (p0.05 across 6 datasets). Moreover, our analysis indicates that roughly 10% of items in the datasets we examined are culturally sensitive: likely to be misclassified as safe without adequate cultural representation. We evaluate LLMs as both rater surrogates and triage tools, finding that current LLMs do not reliably stand in for raters, though they can help prioritize culturally sensitive items for human annotation. Our findings motivate more culturally pluralistic safety evaluation and offer practical takeaways to support it.
[LG-162] GLENS: Global Search via Learning from Solver Iterates with Diffusion Models
链接: https://arxiv.org/abs/2606.00366
作者: Anjian Li,Bartolomeo Stellato,Ryne Beeson
类目: Machine Learning (cs.LG)
*备注:
Abstract:We consider the problem of generating a large collection of initial guesses for local minima of multimodal non-convex continuous optimization problems. The goal is for these initial guesses to be high-quality (i.e., a numerical solver converges quickly) and diverse (i.e., represent many different local minima). Identifying multiple locally optimal solutions enables flexible downstream decision-making, but typically requires expensive global search. Existing data-driven methods predict initial guesses using only the final converged optima from offline solver runs, which discards information about the local neighborhoods of solutions and limits the available training data. We propose GLENS (Global Search via Learning from Solver Iterates), a data-efficient global search method that leverages intermediate solver iterates as free data augmentation. GLENS consists of two components: a neighborhood structure model that uses diffusion models to learn the local geometry around optima conditioned on problem parameters, and a solver behavior model that learns refinement directions to further guide samples towards nearby optima during diffusion sampling. Experiments on modified non-convex benchmark problems and a two-robot obstacle-avoidance navigation problem show that GLENS generates high-quality initial guesses while preserving the multimodal distribution of diverse local optima. The resulting initial guesses lead to faster solver convergence across different problem settings and solvers. We also analyze how key hyperparameter choices affect the performance.
[LG-163] Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults
链接: https://arxiv.org/abs/2606.00345
作者: Flavio Di Martino,Mattia G. Campana,Marcello Magno,Lorenza Pratali,Franca Delmastro
类目: Machine Learning (cs.LG)
*备注:
Abstract:Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real-world settings. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when targeting complex or clinically derived outcomes. In this work, we present a longitudinal multimodal study of 66 older adults conducted in real-world conditions and combining wearable sensing, behavioral monitoring, and clinical assessments. This setting provides a rare opportunity to study an underrepresented population in long-term, into-the-wild conditions. Building on this dataset, we investigate how the alignment between sensed signals and target variables affects predictive performance across health-related tasks. We design a unified evaluation framework spanning tasks with increasing levels of observability, including Activity Levels prediction, Sleep Duration estimation, and Sleep Apnea Severity classification. Our results reveal a clear gradient of predictability: highly observable behavioral targets achieve robust performance (macro-F1 65%), while more abstract outcomes remain challenging despite consistent improvements over baseline models. Moreover, through explainability analysis, we show that historical features consistently emerge as the most informative predictors, highlighting the central role of longitudinal information.
[LG-164] he role of class encoding in neural collapse
链接: https://arxiv.org/abs/2606.00344
作者: Bastien Massion,Roy Makhlouf,Estelle Massart
类目: Machine Learning (cs.LG)
*备注:
Abstract:Neural collapse is a structural property of the last-hidden-layer activations in neural network classification models, when trained beyond a zero classification error. In this work, we explore the role of label encoding in neural collapse by relying on the unrestricted feature model with mean squared error training loss. We demonstrate that, for one-hot encoded labels and balanced data, the uncentered mean features associated with each class transition from a simplex equiangular tight frame to an orthogonal frame when increasing the bias regularization coefficient associated with the final classifier. These structures are reminiscent of the orthogonal frame structure of one-hot encoded labels. For any arbitrary encoding, we also show that the final classifier’s bias aims at centering the labels, compensating for the discrepancy between the global mean of the labels and the origin. We further discuss the role of the encoding in other neural collapse properties.
[LG-165] PE-means: Improved Differentially Private k-means Clustering through Private Evolution
链接: https://arxiv.org/abs/2606.00342
作者: Thomas Humphries,Zinan Lin,Sergey Yekhanin
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Databases (cs.DB)
*备注:
Abstract:We study the problem of differentially private (DP) k -means clustering in Euclidean space. Previous solutions rely on summing the private data directly, which induces a sensitivity proportional to the domain. We introduce PE-means, an extension of the private evolution (PE) algorithm (an increasingly popular method for synthetic data generation), to the problem of k -means clustering. The key advantage of PE is that it only computes a private histogram with constant sensitivity to guide the evolution. Our adaptation of PE includes new evolutionary operators for clustering, as well as other algorithmic improvements of independent interest. Overall, PE-means achieves an average improvement of 20% in clustering loss over state-of-the-art baselines.
[LG-166] Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks ICML2026
链接: https://arxiv.org/abs/2606.00340
作者: Tianyu Pang,Vignesh Kothapalli,Shenyang Deng,Haohui Wang,Dawei Zhou,Yaoqing Yang
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:We study optimal learning-rate selection in two-layer and three-layer linear neural networks trained to learn linear target functions. In particular, we derive the exact closed-form expressions for the gradients and test loss after one and two steps of gradient descent, enabling a precise characterization of early training dynamics. We characterize how learning rates should scale under the gradient approximation in the first two steps, and prove that performing updates with this approximation yields a tractable surrogate loss with a tight, small approximation error. This formulation enables the theoretical analysis of layer-wise learning rates and reveals a distinct early-training regime: test loss can be minimized by unequal learning rates at the initial step, while equal learning rates become optimal in subsequent steps. Our numerical experiments validate the theory and demonstrate the importance of balancing layer-wise learning rates early during training. The code is available at: this https URL.
[LG-167] CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction
链接: https://arxiv.org/abs/2606.00338
作者: Rongchao Dong,Yiming Sun,Shuo Chen,Youmi Oh,Licheng Liu,Yiqun Xie,Xiaowei Jia
类目: Machine Learning (cs.LG)
*备注:
Abstract:Methane is a potent greenhouse gas that significantly contributes to global warming. However, accurately estimating global methane emissions and consumption remains challenging due to the complex interactions among environmental drivers that may vary across spatial and temporal scales. Prior data-driven methods often overlook the inherent spatiotemporal heterogeneity of ecosystems, failing to explicitly capture site-specific characteristics and cross-year evolutionary dynamics. To address these issues, we propose the Contrastive Hierarchical Adaptive Meta-network (CHAM-net), a novel framework that explicitly learns from historical context to capture site-specific dynamics. CHAM-net employs a hierarchical encoder-decoder architecture, in which the encoder captures site-specific characteristics from historical data and then dynamically conditions the decoder to generate the final prediction. Experimental results demonstrate that CHAM-net consistently outperforms all baseline methods on both simulation and observational datasets for methane emission and consumption, achieving nRMSE values as low as 0.43 and 0.88 with corresponding R2 scores up to 0.97 and 0.68 for emission prediction.
[LG-168] Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control
链接: https://arxiv.org/abs/2606.00329
作者: David Mullett
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 29 pages, 7 figures, 2 tables; supplementary materials: 9 pages, 1 figure, 4 tables. Code, derived data packets, and Lean artifact: this https URL (release tag lean-v1.0)
Abstract:Recursive systems can enter collapse-like regimes – self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation – before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence §, and declining diversity ( \delta ). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance. We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP \in [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero’s pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred. The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract – non-acceptance reported as a first-class scientific outcome. Comments: 29 pages, 7 figures, 2 tables; supplementary materials: 9 pages, 1 figure, 4 tables. Code, derived data packets, and Lean artifact: this https URL (release tag lean-v1.0) Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.00329 [eess.SY] (or arXiv:2606.00329v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2606.00329 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-169] KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering
链接: https://arxiv.org/abs/2606.00328
作者: Albert Sawczyn,Piotr Bielak,Tomasz Kajdanowicz
类目: Machine Learning (cs.LG)
*备注: preprint
Abstract:Large language models (LLMs) are increasingly used for knowledge base question answering (KBQA), where answering requires selecting entities from a question-specific knowledge-graph subgraph. Yet LLMs are known to hallucinate across tasks, and KBQA is no exception: even when we provide a graph as the knowledge source, the model may rely on parametric knowledge instead of graph evidence or perform invalid reasoning over the given relations. Such hallucinated answer nodes can limit the practical deployment of KBQA systems, especially in high-stakes domains such as healthcare. We formulate hallucination detection in KBQA as an answer-node classification problem and propose a lightweight graph-based framework that treats the answering LLM as a black box. \methodname represents each KBQA instance as an augmented graph. It initializes node features with semantic representations of KG entities, marks topic entities and LLM-proposed answer nodes with learned vectors, and connect a virtual question node to the topic entities. A graph encoder then produces verification-oriented node representations, and a small MLP classifies each proposed answer node using its graph representation together with the question embedding. Experiments on WebQSP, ComplexWebQuestions, and PUGG show that our detector achieves the highest F1 on all three benchmarks ( 82.0 , 87.4 , and 84.3 ), outperforming LLM-as-judge and sampling-based baselines, while having \sim305\times fewer parameters than the reference approaches. Beyond detection, the node-level feedback is actionable: when flagged answers are fed back to the KBQA system for iterative refinement, downstream KBQA F1 improves by 13.0 – 14.5 points and Exact Match by 16.9 – 17.6 points.
[LG-170] Perturbative methods for non-parametric instrumental variable
链接: https://arxiv.org/abs/2606.00322
作者: Wei Bu,Arthur Gretton
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 8+24 pages, 4 figures, comments welcomed
Abstract:We introduce a perturbative approach for nonparametric instrumental variable (NPIV) estimation. By drawing inspiration from perturbation theory in physics, we extend standard kernel ridge methods with systematic higher perturbation order corrections that significantly improve estimation accuracy. Spectrally, the perturbation introduces mixing between different eigenmodes of the expectation integral operator, which becomes especially useful when the integral equation is ill-defined. One source for such ill-definedness can be the curse of dimensionality. Our method performs across various dimensionality regimes, particularly when the dimensionality parameter \beta which is defined through the number of samples n and dimension d as n^\beta = d , becomes large. Experimental results show that our first-order perturbative corrections can reduce prediction error by up to 99% in high-dimensional ill-defined cases ( \beta 0.7 ) compared to standard ridge regression approaches. The performance improvement is maintained across a wide range of dimensions, with the advantage becoming more pronounced as dimensionality increases.
[LG-171] Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference
链接: https://arxiv.org/abs/2606.00320
作者: Catherine Chen,Jingyan Shen,Zhun Deng,Lihua Lei
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an online, distribution-free framework for controlling the Conditional Value-at-Risk (CVaR), extending conformal tail risk control to non-stationary and adversarial environments. Unlike classical risk control methods, which rely on stationarity or linearity of expectation, our approach provides provable safety guarantees for a nonlinear tail risk functional under arbitrary data-generating processes that may drift or shift strategically over time. By leveraging deep connections between conformal tail risk control, online learning, and the variational representation of CVaR introduced by Rockafellar and Uryasev, we develop a novel procedure for online CVaR control with adversarial regret guarantees. The proposed method operates without assumptions on the underlying data-generating process, making it broadly applicable in modern high-stakes deployment settings. We prove that the realized empirical CVaR is asymptotically controlled at the target level, and that the resulting control is asymptotically tight up to a finite-sample conservatism gap. We demonstrate the effectiveness of our approach on portfolio risk management and toxicity mitigation for Large Language Models (LLMs), where rare but catastrophic failures dominate system risk.
[LG-172] Stochastic Rounding Increases Small Singular Values
链接: https://arxiv.org/abs/2606.00312
作者: Linkai Ma,Tingzhou Yu,Petros Drineas
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Over the past half-dozen years, stochastic rounding (SR) has regained significant attention as a quantization scheme for low-precision floating-point arithmetic, with applications spanning numerical analysis and modern machine learning systems. Recent work has shown that SR acts as an implicit regularizer by increasing the smallest singular value of extremely tall-and-thin (or, symmetrically, short-and-fat) matrices. In this work, we substantially sharpen and extend this understanding in two directions. First, we show that the regularization effect of SR is not restricted to extreme aspect ratio regimes: it persists for matrices with constant aspect ratio. Second, we demonstrate that SR does not merely regularize the smallest singular value, but instead lifts entire clusters of singular values at the tail of the spectrum. Together, these results provide a more general characterization of stochastic rounding as a spectral regularizer, revealing that its effects extend beyond extremal aspect ratios and act on a broader portion of the singular value spectrum.
[LG-173] Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo
链接: https://arxiv.org/abs/2606.00309
作者: Xiaoyu Wang,Jonathan H. Huggins
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Stochastic gradient Langevin dynamics combined with Gibbs updates (SGLD–Gibbs) provides a highly scalable approach to approximate Bayesian inference in latent variable models. However, it remains unclear how to tune the algorithm’s hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD–Gibbs. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space-time rescaling. We show that global parameters converge to a diffusion-type limit, while each latent variable converges to a jump process, reflecting the use of intermittent Gibbs updates. This joint jump-diffusion structure reveals how latent-variable randomness contributes to the stationary distribution of the global parameters. We leverage our results to propose explicit guidance on hyperparameter tuning for SGLD–Gibbs that ensures meaningful uncertainty quantification. Numerical experiments show that SGLD–Gibbs with our tuning guidance leads to better parameter estimates, uncertainty quantification, and predictive performance than stochastic variational inference.
[LG-174] Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection
链接: https://arxiv.org/abs/2606.00304
作者: Yilin Liu,Hongchao Zhang,Taylor T. Johnson,Ahmad F. Taha,Meiyi Ma
类目: Machine Learning (cs.LG)
*备注:
Abstract:Graph anomaly detection methods aim to distinguish anomalous nodes. While prior methods characterize anomalies through increased variation in the spectral energy distributions, they overlook those that result in decreased variation, i.e., camouflaged anomalies that appear normal. We show that this type of anomaly persists across multiple datasets and remains undetectable by existing spectral approaches. To address this limitation, we propose a node-level spectral energy formulation that is fully compatible with message passing and enables the detection of camouflaged anomalies. Building on this formulation, we introduce an energy-aware graph learning framework that models spectral shifts through energy-driven message passing in both static and time-series graphs. Besides, our unified architecture extends to temporal settings without introducing specialized sequence modules, enabling efficient learning under long sliding windows. Extensive experiments on large-scale benchmarks demonstrate the effectiveness and scalability of our approach.
[LG-175] FLaG: Fine-Grained Latent Grouping for Hallucination Detection
链接: https://arxiv.org/abs/2606.00301
作者: Wentao Ye,Liyao Li,Zhiqing Xiao,Muzhi Zhu,Jiaqi Hu,Zhanming Shen,Xiaomeng Hu,Sean Du,Haobo Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hallucinations in large language models (LLMs) arise from heterogeneous failure mechanisms, making reliable detection difficult for any single global uncertainty score. In this work, we formulate hallucination detection as a mechanism-aware evidence aggregation problem, where diverse representation- and token-level signals must be interpreted under multiple latent explanations. We propose FLaG, a lightweight hallucination detection framework that models correctness through a set of latent evidence groups. Each instance is softly associated with multiple groups via an energy-based routing mechanism, and group-conditional reliability signals are combined through a principled log-marginal aggregation. This design enables FLaG to capture heterogeneous hallucination patterns while remaining invariant to decision thresholds and evaluation metrics. The framework operates as a frozen-model head, requires no modification to the underlying language model, and incurs minimal computational overhead. We further provide a theoretical perspective that connects FLaG to optimal evidence aggregation under heterogeneous error mechanisms, showing that the Bayes-optimal test statistic necessarily admits a log-marginal form and that FLaG constitutes a tractable approximation with a controllable error bound. Extensive experiments across multiple benchmarks and LLM backbones demonstrate that FLaG consistently achieves SOTA performance, while exhibiting robust transfer across datasets and models, and remaining effective under limited supervision.
[LG-176] Symmetric Hermite quadrature-based balanced truncation for learning linear dynamical systems from derivative data
链接: https://arxiv.org/abs/2606.00298
作者: Sean Reiter,Steffen W. R. Werner
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Systems and Control (eess.SY); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注: 14 pages, 2 figures, 4 tables
Abstract:Data-driven reduced-order modeling is an essential component in the computer-aided design of control systems. In this work, we present a novel symmetric Hermite formulation of the quadrature-based balanced truncation algorithm that constructs linear reduced-order models from evaluations of the full-order system’s transfer function and its derivative. Significantly, the Hermite formulation preserves desirable qualitative properties of the system used to generate the data, such as state-space Hermiticity and, consequently, asymptotic stability.
[LG-177] Adaptive Order Policies for Masked Diffusion
链接: https://arxiv.org/abs/2606.00295
作者: Jama Hussein Mohamud,Mohsin Hasan,Mirco Ravanelli,Yoshua Bengio
类目: Machine Learning (cs.LG)
*备注:
Abstract:Masked diffusion models have seen great success in capturing data distributions over discrete sequences in domains such as text and proteins. These models generate data by iteratively unmasking tokens starting from a fully masked sequence, with the unmasking order typically chosen at random or using a heuristic based on denoiser probabilities. In this work, we propose a scheme for learning the unmasking order using an additional lightweight policy network on top of a diffusion model. Our proposed loss reweights terms in the masked diffusion loss according to policy probabilities, and results in a policy that prefers positions where the denoiser is more likely to be correct. We study this loss in two settings: (i) training solely the policy while using a frozen pre-trained denoiser, and (ii) training the policy and denoiser jointly with the weighted loss to allow for mutual adaptation. We demonstrate that our approach outperforms common heuristics on problems that are sensitive to token ordering, such as combinatorial tasks and proteins.
[LG-178] Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
链接: https://arxiv.org/abs/2606.00293
作者: Yu Wang,Jie Ding,Jonathan H. Huggins
类目: Machine Learning (cs.LG); Methodology (stat.ME); Machine Learning (stat.ML)
*备注:
Abstract:Tuning algorithms such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified. Existing theory that provides tuning guidance relies on continuous-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the \beta -divergence rather than log-loss to obtain statistically robust inferences.
[LG-179] he Representation-Rationalizability Tradeoff in Reward Learning
链接: https://arxiv.org/abs/2606.00291
作者: Jing Dong,Yaoliang Yu,Pascal Pourpart
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:
Abstract:In RLHF, each training example contains a prompt x and two candidate responses y,y’ , and annotators provide pairwise preferences between these responses. The learning problem is to convert these heterogeneous pairwise judgments into a single scalar reward r(x,y) that measures response quality for each prompt. Classical social choice implies an impossibility because heterogeneous annotator samples can induce pooled preferences with Condorcet cycles, so no scalar reward can evaluate all compared response pairs consistently. A growing literature analyzes RLHF as a social-choice problem, but usually assumes a fixed finite set of alternatives, i.e., a pre-enumerated finite set of candidate responses for each prompt. Modern pipelines instead score responses through a learned representation \phi(x,y) before a scalar head, so \phi determines which responses are treated as distinguishable alternatives and which comparisons are visible to the reward model. Once this embedding is part of the problem, the impossibility results from social choice theory become a tradeoff. We show that the excess cross-entropy loss of any reward built on \phi decomposes exactly into a representational term, which a richer \phi shrinks, and an aggregation term, which a richer \phi enlarges by exposing more comparisons that no scalar can rank consistently. The same results extend to direct preference optimization (DPO), and jointly training the embedding and the reward cannot guarantee to recover the sweet spot of this tradeoff. Experiments on synthetic data and real preference datasets corroborate our results.
[LG-180] Inner Product Aware Quantization: Provably Fast Accurate and Adaptive Algorithms
链接: https://arxiv.org/abs/2606.00289
作者: Nathan White,Krish Singal
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
*备注:
Abstract:Quantization is a fundamental tool used to compress datasets, neural network weights, and memory usage in a range of computational tasks. Many downstream applications of vector quantization perform inner products with arbitrary inputs. This motivates the study of inner product aware quantization schemes that approximately preserve inner products with unseen vectors – in contrast to simply minimizing the mean-squared error. In this work, we formulate objectives that capture natural desiderata and develop adaptive and unbiased quantization methods that approximately preserve inner products with worst-case and average-case inputs. An analysis of these objectives shows a tight connection with the well-studied notion of Adaptive Stochastic Quantization (ASQ). We develop provably fast exact and approximate algorithms for our objectives. Our theoretical results inspire efficient practical algorithms that perform well across a variety of workload distributions. They also lead to practical algorithms for standard ASQ which are 2-10 \times faster than prior state-of-the-art methods while maintaining quality. These theoretical and empirical results contribute towards making adaptive quantization techniques more efficient and tractable in practical settings. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS) Cite as: arXiv:2606.00289 [cs.LG] (or arXiv:2606.00289v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.00289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-181] Bit-Exact AI Inference Verification Without Performance Tradeoffs
链接: https://arxiv.org/abs/2606.00279
作者: Naci Cankaya
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:Verifying claims about AI workloads is a pre- requisite for credible AI governance of covert adversaries (who comply with monitoring only when detection likelihood is high), yet the ap- parent non-determinism of GPU floating-point arithmetic forces auditors to accept approximate output matches. Covert adversaries can exploit un- verifiable degrees of freedom in monitored compu- tation. Attack vectors include steganography, un- reported modification of inference software, and covert computation via unreported batch elements. Empirically, we analyze how modern inference engines (vLLM, HF transformers) produce deter- ministic but non-invariant outputs, without need- ing to set performance-compromising determin- ism flags, if the right information is available for re-computation and no atomic functions are called in the backend. We demonstrate that such bitwise- precise re-computation does not require access to identical hardware, via a software-only emula- tion of LLM inference across multiple NVIDIA GPU variants. Thus, accumulated rounding errors can be an auditable signature of the software and hardware setup used for inference, instead of a constraint on verifiability.
[LG-182] KISS: Keeping it Simple and Slotted when Learning to Communicate over Wireless
链接: https://arxiv.org/abs/2606.00266
作者: Kamil Szczech,Maksymilian Wojnar,Krzysztof Rusek,Katarzyna Kosek-Szott,Szymon Szott
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:A long-standing challenge in distributed wireless systems is ensuring efficient and fair random channel access. Existing solutions often address specific constraints related to timing, periodicity, or centralization, but they typically rely on fixed heuristics. Motivated by recent advances in machine learning (ML), we investigate whether ML agents can autonomously learn efficient and fair access strategies, and whether such learning can offer new insights into medium access control (MAC) design. Rather than proposing a deployable protocol, our aim is to examine whether decentralized learning can rediscover or approximate theoretically efficient random-access mechanisms under minimal assumptions. To this end, we deploy an off-policy Double Deep Q-Network (DDQN) with Bayesian inference to train agents operating over a slotted channel. The resulting method is fully online (no pre-training), fully distributed (independent multi-agent learners), stochastic (non-periodic), and requires no coordination or explicit communication. Extensive simulations show that the learned strategy adapts to varying network conditions and achieves near-theoretical efficiency while maintaining fairness. Ablation studies further reveal that the learned behavior resembles slotted ALOHA with a dynamically adjusted transmission probability, leading us to refer to the method as KISS: Keeping It Simple and Slotted.
[LG-183] Per-Group Error Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation ICRA2026
链接: https://arxiv.org/abs/2606.00253
作者: Pau Montagut Bofi,Mario García Blasco,Tessa Pulli,Markus Vincze
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop “From Data to Decisions: VLA Pipelines for Real Robots”. Code: [ this https URL ]( this https URL )
Abstract:Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against \pi_0.5 (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of \pi_0.5 (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), \pi_0.5 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney p \leq 0.010 ), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: this https URL Comments: 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop “From Data to Decisions: VLA Pipelines for Real Robots”. Code: [this https URL](this https URL) Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2606.00253 [cs.RO] (or arXiv:2606.00253v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2606.00253 Focus to learn more arXiv-issued DOI via DataCite
[LG-184] HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads
链接: https://arxiv.org/abs/2606.00252
作者: Songyang Liu,Shunyu Yao,Dingyuan Huang,Shuai Li
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.
[LG-185] A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization
链接: https://arxiv.org/abs/2606.00230
作者: Sherin Muckatira,Namrata Shivagunde,Vijeta Deshpande,Anna Rumshisky
类目: Machine Learning (cs.LG)
*备注: 18 pages, 10 figures, 9 tables
Abstract:Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited data repetition and no explicit train/validation split. To address this, we propose an exposure-based framework that enables the study of grokking-like dynamics during LLM pre-training. We ground our evaluation in BLiMP minimal pairs, which provide controlled grammatical contrasts. For every BLiMP minimal pair, we identify a critical phrase, the smallest continuous span that captures the grammatical contrast and the phenomenon-relevant context. Examples whose critical phrase appears in the pre-training window are assigned to the proxy-train split; the remaining examples are assigned to the proxy-validation split. Across five grammatical phenomena, we observe delayed generalization. Analyzing pre-training checkpoints before and after generalization shows that grammatical concept vectors become more predictive of grammatical acceptability and occupy a higher-dimensional subspace after generalization. We also find that attention from the critical token to the relevant context token is concentrated in a small number of heads.
[LG-186] LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching ICML2026
链接: https://arxiv.org/abs/2606.00228
作者: Yao Lai,Xuyuan Xiong,Zeyue Xue,Guojin Chen,Jing Wang,Xihui Liu,Rui Zhang,Robert Mullins,Bei Yu,Ping Luo
类目: Machine Learning (cs.LG)
*备注: ICML 2026
Abstract:In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow-matching paradigm with GRPO-based reinforcement learning (RL) fine-tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization-based approaches, RL in LithoGRPO exploits the explicitly defined, physics-based reward function of ILT, enabling optimization under complex, process-aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shot-counting algorithm for manufacturability evaluation, achieving over 130x speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state-of-the-art performance over both optimization-based and learning-based methods, while maintaining efficient mask generation.
[LG-187] Quantized Reasoning Models Think They Need to Think Longer but They Do Not
链接: https://arxiv.org/abs/2606.00206
作者: Sanae Lotfi,Polina Kirichenko,Steven Li,Zechun Liu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accuracy while increasing chain-of-thought (CoT) length. Surprisingly, we show that in up to 52% of the quantized models’ failures, models reach the right answer in intermediate reasoning steps but do not output it as a final answer. To understand why quantization leads to this increase in overthinking errors, we measure the token-level KL divergence between quantized and full-precision output distributions. Positions with high KL divergence correlate strongly with high next-token entropy, and at these positions quantized models disproportionately sample overthinking markers such as “wait”, “but”, and “alternatively”. We show that simply introducing a training-free logit penalty on a curated set of overthinking markers can reduce CoT length by 12–23% while preserving or improving accuracy across 5 models (1.5B-32B parameters), 3 quantization methods, and 5 benchmarks, yielding a favorable Pareto frontier of accuracy against reasoning cost compared to penalizing other token sets. Overthinking errors produced by quantized models are particularly reduced by up to 58%.
[LG-188] PaintBench: Deterministic Evaluation of Precise Visual Editing
链接: https://arxiv.org/abs/2606.00188
作者: Kai Xu,Ellis Brown,Shrikar Madhu,Rob Fergus,He He,Saining Xie
类目: Graphics (cs.GR); Machine Learning (cs.LG)
*备注: Project Page: this https URL
Abstract:While current multimodal models are proficient at open-ended visual editing, executing precise single-answer edits remains an important obstacle. To probe this challenge, we introduce PaintBench, a dynamically scalable benchmark targeting 20 fundamental precise visual editing operations across four categories: geometric transformation, structural manipulation, color change, and symbolic reasoning. Procedural generation with configurable complexity enables an effectively infinite, contamination-resistant evaluation suite, and deterministic pixel-level evaluation eliminates reliance on bias-prone judge models. Across 11 image editing models, we find overall low performance, with the current highest-performing industry leader scoring only 17.1% (mIoU). Task decomposition reveals especially challenging operation types (geometric transformation, most structural manipulation, formula-based color change) and model-specific specializations. Fine-grained benchmark diagnostics further show performance degradations induced by scene variations in object count, background complexity, color scheme, and edit-region size. To test generalization of PaintBench scores to applied task performance, we create a procedural, deterministic evaluation for data visualization editing (TinyGrafixBench) and find strong linear correlation with PaintBench scores ( R^2 = 0.91 , p 0.001 ). Altogether, PaintBench provides a rigorous foundation for measuring and driving progress in precise multimodal visual editing.
[LG-189] AI-Guided Design and Optimization of Graphite-Based Anodes via Iterative Experimental Feedback
链接: https://arxiv.org/abs/2606.00187
作者: Qian Du,Mark M. Sullivan,James E. Saal,Florian Huber
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注: 12 pages, 10 figures, 2 tables
Abstract:This study presents an iterative AI-guided workflow that accelerates graphite-based anode development by improving both formulation feasibility and process robustness. Sequential learning via AI/ML-guided multiobjective inverse design for anode optimization was implemented using the Citrine Platform. Starting from a noisy, incomplete dataset, the Citrine Platform was used to generate early surrogate models, which despite low predictive certainty highlighted missing process constraints. By iteratively adding feasibility labels and boundary condition failures, the workflow rapidly converged toward manufacturable, higher-performing formulations. Fabrication reliability improved from frequent process failures to 100% successful cell production, while the fraction of cells delivering \geq 350 mAh g ^-1 increased from 28.4% to 84.8%, with capacity retention rising from 42.1% to 97.3%. These results demonstrate that structured, feedback-driven AI workflows can transform imperfect industrial data into actionable guidance, enabling faster, more reproducible optimization of battery electrode manufacturing.
[LG-190] World Models: A Comprehensive Survey of Architectures Methodologies Reasoning Paradigms and Applications
链接: https://arxiv.org/abs/2606.00133
作者: Arif Hassan Zidan,Yi Pan,Hanqi Jiang,Ruiyu Yan,Wei Ruan,Zihao Wu,Lifeng Chen,Weihang You,Xinliang Li,Bowen Chen,Huawen Hu,Peilong Wang,Sizhuang Liu,Jing Zhang,Siyuan Li,Zhengliang Liu,Yu Bao,Lin Zhao,Lichao Sun,Dajiang Zhu,Xiang Li,Jinglei Lv,Quanzheng Li,Wei Liu,Tianming Liu,Wei Zhang
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:
Abstract:World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi-axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state-space and recurrent approaches, transformer-based models, diffusion-based generators, physics-informed networks, and language-augmented multimodal systems; (iii) reasoning strategy, covering imagination-based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive-science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain-of-thought reasoning with world-model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim-to-real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation-scale interactive simulators, and safe deployment in safety-critical domains.
[LG-191] Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems
链接: https://arxiv.org/abs/2606.00059
作者: Julian Langschwert,Georg Schaefer,Jakob Rehrl,Stefan Huber,Simon Hirlaender
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Accepted at DEXA AI4IP 2026
Abstract:Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.
[LG-192] Auditing Asset-Specific Preferences in Financial Large Language Models : Evidence from Bitcoin Representations and Portfolio Allocation
链接: https://arxiv.org/abs/2606.02528
作者: Wenbin Wu
类目: General Finance (q-fin.GN); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 28 pages, 5 figures, 18 tables
Abstract:Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions? We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin’s ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as “reliable money” but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model’s internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when “Bitcoin” never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin’s portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure. We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved. Comments: 28 pages, 5 figures, 18 tables Subjects: General Finance (q-fin.GN); Computers and Society (cs.CY); Machine Learning (cs.LG) Cite as: arXiv:2606.02528 [q-fin.GN] (or arXiv:2606.02528v1 [q-fin.GN] for this version) https://doi.org/10.48550/arXiv.2606.02528 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-193] owards Automated Discovery: A Review of Generative Models Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design
链接: https://arxiv.org/abs/2606.02507
作者: Anand Babu,Rogério Almeida Gouvêa,Gian-Marco Rignanese
类目: Materials Science (cond-mat.mtrl-sci); Emerging Technologies (cs.ET); Machine Learning (cs.LG); Applied Physics (physics.app-ph); Computational Physics (physics.comp-ph)
*备注:
Abstract:Inverse materials design is shifting materials discovery from forward prediction to targeted proposal of candidates that satisfy objectives under physical constraints. Here, we review recent advances in generative crystal structure modeling, multimodal learning, and closed-loop design pipelines for crystalline solids. We survey how modern generators learn chemical-structural priors from large databases to enable controllable sampling of periodic structures, and compare leading model classes including variational autoencoders, normalizing flows, autoregressive formulations, and diffusion models. Particular attention is given to how feasibility constraints and physical priors are enforced across the workflow, through representation choices, training objectives, sampling-time guidance, and post-generation screening and relaxation. We also discuss how multimodal learning fuses diverse materials modalities, including crystal structures, thermodynamic, electronic information, microscopy, spectroscopy, processing context, and scientific text, to construct a more universal, transferable representation of chemical space. In addition, diverse inverse-design strategies are examined, particularly those that integrate conditional generation with latent optimization, Bayesian optimization, reinforcement learning, and active learning. Finally, we highlight recurring failure modes, such as surrogate exploitation, diversity collapse, distribution shift, and the stability-synthesizability gap, and outline discovery-grade evaluation practices based on staged reporting of validity, novelty, uniqueness, stability, and cost.
[LG-194] How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations
链接: https://arxiv.org/abs/2606.02385
作者: William Dorrell
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: 27 pages, 5 figures
Abstract:Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features. Theoretically, we lack a clear account of what properties a ‘concept’ must satisfy for an SAE to extract it. There has been extensive identifiability work studying the conditions under which sparse coding recovers ground-truth features; however, these approaches tends to focus on simple data-generating models (e.g. sparse independent features) which poorly approximate the internet-swallowing language-model representations on which SAEs are trained. Here, avoiding data-generating models, we ask simply what properties any dictionary learning optimum must satisfy. Concretely, we extend local optimality analyses (Gribonval Schnass, 2010) to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these constraints to explain a range of observed SAE behaviours - hierarchical splitting absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Finally, we construct a novel large-dictionary convex problem and explore the wide atom-per-datapoint limit. In sum, we hope to tease model assumptions from unexpected observations, letting us learn more from SAEs’ successes and provide principles for designing their successors.
[LG-195] Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization
链接: https://arxiv.org/abs/2606.02345
作者: Louise Davy,Stephan Clémençon,Charlotte Laclau
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.
[LG-196] ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation ICML2026
链接: https://arxiv.org/abs/2606.02247
作者: David Rundel,Fabian Fumagalli,Maximilian Muschalik,Bernd Bischl,Matthias Feurer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)
Abstract:Shapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on value function evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous evaluations. This is particularly relevant in settings where the value function is costly and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose ShaplEIG, a Bayesian experimental design approach that approximates the expensive value function using a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. By the linearity of the Shapley values in the value function, we show that the expected information gain is available in closed form. Furthermore, we propose an efficient computation scheme that reduces the complexity from exponential to polynomial in the number of players via elementary symmetric polynomials. In extensive experiments across diverse costly applications, our method consistently improves sample efficiency in the low-budget regime over state-of-the-art baselines.
[LG-197] Identifiable Markov Switching Models with Instantaneous Effects and Exponential Families ICML
链接: https://arxiv.org/abs/2606.02231
作者: Roel Hulsman,Carles Balsells-Rodas,Sara Magliacane
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: International Conference on Machine Learning (ICML) 2026
Abstract:Temporal systems often exhibit non-stationary behaviour, such as seasonal climate variation or glucose fluctuations in patients with type-1 diabetes. One way to model non-stationarity is through discrete latent regimes, i.e., stationary segments of time. Such systems induce a Markov Switching Model (MSM), a class of Hidden Markov Models with autoregressive dependencies among latent regimes and observed variables. Identifying latent regimes is challenging in the presence of frequent regime switches and nonlinear and non-Gaussian dynamics, particularly when there are instantaneous effects between the variables, e.g., due to slow rates of measurements. In this work, we establish the identifiability of both latent regimes and regime-dependent causal structures under temporal regime dependencies, nonlinear lagged and instantaneous effects, and independent noise from the exponential family. Our identifiability theory subsumes non-temporal mixtures of causal models. Furthermore, we introduce FlowMSM, a regime detection framework that can be paired with any stationary causal discovery method to recover regime-dependent causal structures. Experiments on synthetic benchmarks and a financial economics dataset demonstrate the effectiveness of our approach to detect latent regimes and discover causal structures from non-stationary time series.
[LG-198] ProbRes: Volatility Learning for Probabilistic Time-Series Forecasting
链接: https://arxiv.org/abs/2606.02117
作者: Tingting Wang,Yunyi Zhang,Benyou Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:Probabilistic time series forecasting has attracted increasing attention in financial applications due to the need to quantify risk and uncertainty in future observations. We propose ProbRes, a post-hoc probabilistic calibration method that explicitly learns and incorporates volatility dynamics into probabilistic forecasting, enabling effective handling of heteroskedastic data. During training, ProbRes employs two architecture-agnostic modules to separately model the conditional mean and conditional volatility. At the inference stage, it generates predictive distributions by resampling normalized residuals. ProbRes is applicable to both univariate and multivariate time series and remains robust under a wide range of error distributions, including non-Gaussian innovations with conditional heteroskedasticity. Theoretical results demonstrate ProbRes’s validity and experiments on both synthetic and real-world datasets show that ProbRes accurately captures predictive distributions and produces well-calibrated prediction intervals.
[LG-199] Error Bounds for a Diffusion Model-Based Drift Estimator
链接: https://arxiv.org/abs/2606.02115
作者: Ioar Casado-Telletxea,Omar Rivasplata
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Preprint
Abstract:Parameter estimation in stochastic differential equations is a classical statistical problem of much importance in many scientific fields. Recent work of Tapia Costa et al. (2026) introduced a novel technique for estimating the drift when the diffusion parameter is known, using discrete samples from multiple trajectories. Their method treats drift estimation as a denoising problem, and leverages tools from (conditional) score-matching diffusion models. Although their experiments showed promising results across different drift classes, the question of theoretical guarantees for their estimator was left unanswered. In this note, we address this gap by exploiting techniques from diffusion model theory. More concretely, we derive an explicit risk bound for the time-averaged mean-squared error of said drift estimator. Our bound decomposes the risk into the (i) Euler-Maruyama discretization, (ii) score/denoiser approximation, (iii) noise initialization, and (iv) sampling variance, revealing the trade-offs between the different hyperparameters and sources of error in the estimator.
[LG-200] It does what it says on the tin: safe synthetic data from coarsened margins
链接: https://arxiv.org/abs/2606.02101
作者: Gillian M Raab
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.
[LG-201] Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation ICML2026
链接: https://arxiv.org/abs/2606.02047
作者: Junhyoung Chung,Euijong Song,Won Hwa Kim,Gunwoong Park
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注: This paper is 41 pages long, contains 6 figures, and has been accepted to ICML 2026
Abstract:We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operator-based regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov–Wasserstein (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank–Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.
[LG-202] Uncertainty-Aware Graph Neural Reconstruction of Urban Temperature Fields from Sparse Sensors under Deployment Constraints
链接: https://arxiv.org/abs/2606.02038
作者: Reda Snaiki,Abdelatif Merabtine
类目: Applied Physics (physics.app-ph); Machine Learning (cs.LG)
*备注:
Abstract:Reconstructing spatially continuous daily temperature fields from sparse observations is important for urban climate monitoring and heat-risk analysis, but practical deployments are limited by sensor budgets and spacing constraints. This study proposes an uncertainty-aware graph neural network (GNN) framework for reconstructing daily maximum temperature fields from sparse sensors while supporting distance-constrained sensor placement and probabilistic exceedance mapping. The model predicts both the temperature field and a spatially varying predictive uncertainty field using a graph-attention-based mean-residual architecture trained with a Gaussian negative log-likelihood. Sensor placement is addressed using a Proper Orthogonal Decomposition with QR factorization (POD-QR) strategy with a 4 km minimum inter-sensor distance constraint and is compared with random feasible placement and farthest-point sampling. The framework is evaluated over a Montreal-area polygon using Daymet v4.1 daily temperature data (1 km resolution) under a strict temporal hold-out protocol (training: 2020-2023; testing: 2024). Across sensor budgets (10-40 sensors), the proposed GNN consistently outperforms inverse distance weighting and ordinary kriging in RMSE and MAE on unobserved nodes. Sensor-placement effects are most pronounced at low budgets and diminish at higher budgets, with a practical saturation regime emerging around 30 sensors under the imposed spacing constraint. Probabilistic evaluation further shows improved uncertainty calibration with increasing sensor density and a better sharpness-calibration trade-off than kriging. These results support the proposed framework as an effective tool for uncertainty-aware temperature field reconstruction and decision-oriented heat-risk mapping.
[LG-203] Provable Data Scaling Law for Meta Learning via Complexity Minimization
链接: https://arxiv.org/abs/2606.02008
作者: Kazuto Fukuchi,Ryuichiro Hataya,Kota Matsui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.
[LG-204] Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler ICML2026
链接: https://arxiv.org/abs/2606.01827
作者: Dimitris Oikonomou,Nicolas Loizou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Sharpness-Aware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically selected through extensive hyperparameter tuning or predefined schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In the smooth setting, we prove linear convergence for strongly convex objectives and an \mathcalO(1/T) convergence rate for convex objectives in the deterministic case. In the stochastic setting, we establish analogous convergence guarantees up to a neighborhood of the optimum. Numerical experiments demonstrate that the proposed Polyak schedulers achieve performance comparable to or better than carefully tuned SAM baselines, while substantially reducing the need for learning-rate tuning.
[LG-205] Site4Drug: Predicting Drug-Binding Target Sites with an AI Agent ICML2026
链接: https://arxiv.org/abs/2606.01816
作者: Taehan Kim,Sarrah Rose Mikhail Leung,Bharat Mekala,Jeongbin Park
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注: Accepted to the ICML 2026 Workshop on Generative and Agentic AI for Biology (GenBio)
Abstract:Selecting where to intervene on a protein (i.e., choosing a targetable site) is often a more ambiguous and failure-prone bottleneck than selecting what binds, especially for membrane proteins where accessibility, topology, and post-translational modifications (PTMs) constrain actionable regions. We present Site4Drug, a modality-aware site-finding agent that outputs a ranked list of targetable regions with explicit constraints, evidence summaries, risk flags, and a traceable decision log. Rather than requiring users to specify the drug modality upfront, Site4Drug can recommend a binding modality (e.g., antibody/peptide-like vs small-molecule) from the same evidence used for site discovery, including topology, hydropathy, PTM propensity, disulfides, domain context, and sequence. Importantly, this evidence is applied consistently across modalities, including small-molecule pocket discovery, to avoid selecting chemically plausible but biologically occluded sites. Comments: Accepted to the ICML 2026 Workshop on Generative and Agentic AI for Biology (GenBio) Subjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG) Cite as: arXiv:2606.01816 [q-bio.BM] (or arXiv:2606.01816v1 [q-bio.BM] for this version) https://doi.org/10.48550/arXiv.2606.01816 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-206] Accelerating Min-Max Optimization via Power-Law Stepsizes
链接: https://arxiv.org/abs/2606.01764
作者: Yue Wu,Weiqiang Zheng,Yang Cai,Haipeng Luo
类目: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注: 56 pages
Abstract:We revisit the convergence guarantees of the Extragradient (EG) method for unconstrained biaffine min-max optimization. It is known that EG with a fixed stepsize achieves a \Theta(T^-1/2) last-iterate convergence rate, which is slower than the optimal \mathcalO(T^-1) rate attainable by incorporating additional mechanisms such as anchoring. Motivated by recent advances showing that dynamic stepsizes alone can significantly accelerate gradient descent, we ask whether dynamic stepsizes can similarly accelerate the last-iterate convergence of EG. We present the first positive result in this direction. Specifically, we provide a deterministic dynamic stepsize schedule that accelerates the convergence rate of EG to \mathcalO(T^-2/3+\varepsilon) for any \varepsilon 0 . We also show that this rate is tight when the extrapolation and update steps of EG use the same stepsize. We then show that allowing different stepsizes for the extrapolation and update steps further improves the convergence rate to the near-optimal \mathcalO(T^-1+\varepsilon) . Our analysis reduces stepsize scheduling to an optimization problem, whose solution leads to a stepsize schedule that follows (a discretization of) a power-law distribution. Our proposed stepsize schedules and analysis extend to other methods, such as Optimistic Gradient (OG), and suggest broader applicability to general min-max optimization problems. Comments: 56 pages Subjects: Optimization and Control (math.OC); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) MSC classes: 68Q32 (Primary) 90C47, 91A26 (Secondary) ACMclasses: I.2.6; F.2.1 Cite as: arXiv:2606.01764 [math.OC] (or arXiv:2606.01764v1 [math.OC] for this version) https://doi.org/10.48550/arXiv.2606.01764 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-207] Self-Regulating Annealing in Heavy-Tailed Diffusion Models IJCNN2026
链接: https://arxiv.org/abs/2606.01645
作者: Keito Wakatsuki,Hideaki Shimazaki
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 6 pages, 3 figures, IJCNN2026
Abstract:Diffusion models have emerged as a leading framework for deep generative modeling. While the standard Gaussian formulation is theoretically convenient, its suitability for heavy-tailed datasets remains unclear. To address this, heavy-tailed diffusion models (HTDMs) extend the standard formulation by replacing the Gaussian distribution with a Student’s t-distribution, thereby improving tail fidelity on heavy-tailed datasets. Although stochastic differential equation (SDE)-based sampling is possible in HTDMs, it has not been fully explored. In this paper, we propose an SDE-based sampler for HTDMs that explicitly incorporates a state-dependent diffusion coefficient. This state dependence naturally induces a self-regulating annealing mechanism by adaptively modulating the effective noise scale. We theoretically explore this mechanism and experimentally verify its necessity for reproducing samples from a heavy-tailed distribution.
[LG-208] Scalable Counterfactual Risk Estimation for Rare Events in Longitudinal Data KDD-2026
链接: https://arxiv.org/abs/2606.01539
作者: Xiaohui Yin,Avijit Mitra,Ying Zhou,Kun Chen,Hong Yu
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: Accepted at KDD-2026, 12 pages
Abstract:Estimating the causal effect of time-varying treatments on survival outcomes in large observational studies is computationally demanding, particularly when outcomes are rare. While g-formula-based methods such as the iterative conditional expectation (ICE) estimator provide a principled framework for longitudinal causal inference, they become computationally expensive, especially when bootstrap-based variance estimation is required. In addition, outcome rarity at each time point induces severe class imbalance, leading to instability and convergence issues in logistic regression and related models. To address these challenges, we propose a principled subsampling and reweighting strategy for longitudinal survival data that can be applied to a range of existing causal effect estimators in this setting, including the ICE estimator. The proposed method substantially reduces computational burden while preserving consistency and improving estimation stability in rare-outcome settings. We evaluate the method through simulations and validate it using a large-scale EHR cohort study on social and behavioral determinants of health (SBDH) and suicide risk, demonstrating its effectiveness for modeling rare outcomes in longitudinal data.
[LG-209] Spatially Distributed Task-Oriented Compression for Multi-Emitter Localization and Characterization with Spectral Overlap
链接: https://arxiv.org/abs/2606.01446
作者: H. Nazim Bicer,J. Nick Laneman
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 6 pages, 2 figures
Abstract:Radio frequency spectrum awareness requires the ability to detect, localize, and characterize emitters in dense and contested wireless environments. In this work, we propose a task-oriented distributed compression framework for joint multi-emitter localization and characterization using spatially distributed receivers. Each receiver observes a short window of complex IQ samples, converts the observation to a time–frequency representation, and encodes it into a compact latent vector. A central fusion decoder combines the receiver latents to estimate an unordered set of active emitters, including their locations, center-frequency offsets, occupied bandwidths, and waveform families. A permutation-invariant training objective is used to handle the arbitrary ordering of emitters and predictions. Experiments on synthetic multi-emitter scenes with spectral overlap show that even extremely compact receiver-side representations can preserve useful information for emitter counting and waveform-family estimation. However, accurate localization and spectral-parameter regression require larger latent dimensions. Increasing the receiver latent dimension from d_\mathrmrx=1 to d_\mathrmrx=16 provides the largest improvement, while further increasing to d_\mathrmrx=64 gives smaller gains. These results demonstrate the potential of learned task-oriented compression for communication-efficient distributed spectrum awareness.
[LG-210] On the Uncertainty Quantification Ability of Tabular Foundation Models
链接: https://arxiv.org/abs/2606.01427
作者: Tyler R. Johnson,Kian Ben-Jacob,Nima Negarandeh,Oriol Vendrell-Gallart,Ramin Bostanabad
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 2 tables
Abstract:Foundation models (FMs) have achieved substantial success in generalizing across tasks without problemspecific training or fine-tuning. However, many critical applications in mechanics and computational science require not only accurate predictions but also reliable uncertainty quantification (UQ). Herein we investigate the UQ capabilities of tabular FMs in regression tasks through a comprehensive empirical study comparing Tabular Prior-Data Fitted Networks (TabPFN) against Gaussian processes (GPs). We systematically evaluate these two methods across a host of regression problems with varying complexity, dataset sizes, and input dimensionalities. We use a default setting to build all the GPs and for a fair comparison against TabPFN v2.5. Our findings highlight an important trade-off between explicit and learned priors: while TabPFN achieves highly competitive performance for complex, high-dimensional problems with sufficient data, GPs often provide superior predictive accuracy and UQ in data-scarce settings. Moreover, when the chosen kernel constitutes a good prior for the underlying function, GP performance can substantially exceed that of TabPFN. Our results can be reproduced from this https URL.
[LG-211] Distribution-free changepoint localization after sequential change detection
链接: https://arxiv.org/abs/2606.01256
作者: Aytijhya Saha,Aaditya Ramdas
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:
Abstract:This paper introduces a distribution-free framework for constructing post-detection confidence sets for changepoints after stopping a sequential change detection procedure. It is well known that conformal test martingales can be used to sequentially detect changes in distribution, but by themselves provide no inference for the time at which a proclaimed change occurred. Past work on post-detection inference requires pre- and post-change classes of distributions to be known, but this paper accomplishes localization of the changepoint without any distributional assumptions. We establish finite-sample coverage guarantees (conditional on correct detection). We provide non-asymptotic bounds on the conditional expected size of the confidence sets. Under suitable asymptotic regimes, we proved that the conditional expected size of the confidence set remains uniformly bounded. and demonstrate strong empirical performance on simulated and real data. To the best of our knowledge, this is the first general distribution-free framework for sequential changepoint localization with a valid post-detection coverage guarantee.
[LG-212] Efficient Approximation for Encoder–Decoder Neural Operators via Variation Spaces
链接: https://arxiv.org/abs/2606.01244
作者: Jia-Qi Yang,Lei Shi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA); Numerical Analysis (math.NA); Statistics Theory (math.ST)
*备注: 14 pages
Abstract:We study operator learning using encoder–decoder neural networks. Inspired by the function-space theory of neural networks, we introduce a variation space as an infinite-dimensional structural class for nonlinear operators. This space is defined through vector-valued measures directly on the input and output spaces. For operators in this space, we establish approximation bounds for encoder–decoder two-layer networks in the Bochner L^q norm. The resulting error bound decomposes into the input encoding error, the output encoding error, and a finite-width approximation term of order N^-1/2 , with a constant independent of the input and output encoding dimensions. When the input and output encoding errors decay polynomially in the encoding dimensions, these estimates yield algebraic approximation and learning rates. The results provide an theoretical guarantees for efficient neural operator learning beyond general Lipschitz or Fréchet differentiable operator classes.
[LG-213] Context-aware child-directed speech detection from long-form recordings
链接: https://arxiv.org/abs/2606.01134
作者: Théo Charlot,Tarek Kunze,Kaveri K. Sheth,Alejandrina Cristia,Marvin Lavechin
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 6 pages, 1 figure
Abstract:Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children’s language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.
[LG-214] Accelerating physics-informed neural networks for full waveform inversion using a hybrid quantum-classical finite-basis architecture
链接: https://arxiv.org/abs/2606.01110
作者: Hoang Anh Nguyen,Divakar Vashisth,Ali Tura
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 20 pages
Abstract:Full waveform inversion (FWI) reconstructs heterogeneous material properties from receiver data but remains computationally demanding. Physics-informed neural networks (PINNs) and their domain-decomposed variants (FBPINNs) offer a mesh-free alternative but face convergence challenges when representing complex velocity fields. We present a hybrid quantum-classical FBPINN for acoustic FWI, bringing together quantum computing and classical machine learning, in which the decomposed wavefield network and the global velocity network are implemented as classical-to-quantum pipelines terminating in parameterized quantum circuits (PQCs). The PQCs are realized as differentiable JAX statevector simulators, enabling end-to-end automatic differentiation through the classical PINN, the quantum circuit, and the physics-informed loss. On a geophysical anomaly benchmark, the quantum hybrid reaches a lower L1 velocity error than the primary classical FBPINN baseline in approximately 8x fewer training iterations, despite using approximately 33% fewer trainable parameters, and it outperforms all 15 classical hyperparameter variants tested. A second benchmark (checkerboard) demonstrates the generality of the inversion pipeline, confirming that the quantum hybrid architecture can recover structured spatial variations beyond the localized anomaly benchmark. Our framework is broadly applicable to wave-based inverse problems beyond geophysics, including medical ultrasound tomography and non-destructive evaluation.
[LG-215] Measuring the Symmetry–Data Exchange Rate
链接: https://arxiv.org/abs/2606.01090
作者: Ahmed M. Adly
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures. Exploratory study. Code and data at this https URL
Abstract:Equivariance theory predicts that an architectural symmetry prior reduces sample complexity by a factor of |G|; this is widely cited but rarely measured as a scaling law with controls that separate the prior from its confounds. On a controlled C_n-symmetric task, we report three findings. First, a wrong-group control with identical orbit size and matched compute is worse than no constraint (joint pairwise CI [+0.79, +3.26] excludes zero, robust across estimators); misaligned constraint is actively harmful, not merely unhelpful. Second, an augmentation baseline equipped with test-time orbit averaging matches the equivariant model exactly – bit-identical per-epoch validation curves across matched cells – so the architecture-vs-augmentation gap is conditional on asymmetric test-time computation, not unconditional. Third, the relative exchange rate beta_diff = 1.28 is consistent in sign and order of magnitude with the theoretical 1.0 (single-level CI [+0.92, +2.05]); the more conservative two-level bootstrap (seeds x group sizes) widens this to [-0.63, +1.72], including zero, and a finer-N replication on a sqrt(2)-spaced grid is inconclusive (point estimate -0.82). The methodological contributions – the relative-rate estimator that cancels the shared-difficulty confound, the wrong-group control, and a pre-specified failure taxonomy – transfer to any inductive bias whose strength can be parameterised. Honest scoping: the primary estimator beta_diff was adopted post-hoc after the initial analysis revealed a positive-slope identifiability problem; the design was never externally pre-registered; and the headline number rests on an OLS slope over seven group sizes on a coarse N grid. This is an exploratory study, not a confirmatory measurement; the wrong-group result is the cleanest finding and the one we report with the most confidence. A registered replication on fresh seeds is future work.
[LG-216] heoretical Analysis of Engression and Reverse Markov Engression
链接: https://arxiv.org/abs/2606.01002
作者: Jiaqi Huang,Gongjun Xu,Ji Zhu
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:
Abstract:Engression is a recently proposed and effective framework for conditional distribution learning. Its multi-step Reverse Markov extension further improves generative flexibility by decomposing complex conditional sampling into sequential reverse transitions. Despite their strong empirical performance, rigorous finite-sample statistical guarantees for these methods remain unavailable. In this paper, under deep neural network parameterizations, we establish nonasymptotic convergence bounds for Engression by directly controlling the Energy Distance between the learned and target conditional distributions. For the Reverse Markov framework, we further develop an Energy-Distance-based chain rule that enables a rigorous analysis of error propagation across reverse steps. Our analysis yields corresponding excess-risk bounds that are near-optimal up to logarithmic factors relative to the classical minimax rate over a general Hölder class.
[LG-217] Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates ICML2026
链接: https://arxiv.org/abs/2606.00984
作者: Sanghoon Yu,Min-hwan Oh
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Accepted at ICML 2026
Abstract:We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many “strictly batched” methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round’s context). For linear contextual bandits, we propose two practical algorithms with only O(\log\log T) parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in T ) simultaneously in both the small- K and large- K regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step – a dominant computational bottleneck in prior strictly batched static-grid methods – yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under O(\log\log T) parameter updates that are also computationally efficient in practice.
[LG-218] Efficient Synthetic Network Generation via Latent Embedding Reconstruction
链接: https://arxiv.org/abs/2606.00934
作者: Feifan Jiang,Yinan Bu,Shihao Wu,Gongjun Xu,Ji Zhu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:
Abstract:Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at this https URL.
[LG-219] Bandit Simulation for Averag e Reward Inference
链接: https://arxiv.org/abs/2606.00913
作者: Samya Praharaj,Chih-Yu Chang,Koulik Khamaru,Kelly W. Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical inference methods cannot be used because bandit algorithms introduce complex dependencies in the collected data, which violate the i.i.d. assumption underlying many classical approaches. Moreover, existing inference methods for adaptively collected data only apply to estimands that do not depend on the data-collection algorithm (such as the mean reward under a fixed action). We propose Bandit Simulation for Inference (BSI), a framework that fits a simulator of the bandit environment from observed data–either on-policy or off-policy–and uses it to estimate the mean reward under any evaluation policy, including adaptive blackbox algorithms. BSI formally propagates uncertainty in the estimated simulator parameters into the confidence interval construction. Furthermore, for BSI to be valid, it requires only weak exploration assumptions on the behavior policy and avoids importance weighting. We prove that BSI yields asymptotically valid confidence intervals, and demonstrate empirically that it maintains nominal coverage in settings where standard off-policy evaluation methods fail.
[LG-220] ny Recursive Models for Solving the J2-Perturbed Lambert Problem
链接: https://arxiv.org/abs/2606.00895
作者: Minduli Wijayatunga,Roberto Armellin
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:This paper presents a fast, recursive neural solver for the J2-perturbed Lambert problem based on Tiny Recursive Models (TRM), termed the TRM-Perturbed Lambert (TRM-PL) model. TRM is a weight-shared architecture whose effective capacity emerges from iteration depth rather than parameter count: a compact reasoning module is applied repeatedly within a two-level latent hierarchy, refining a candidate departure velocity by simulating the J2 trajectory and correcting it from the resulting tracking error. This unifies initial-guess generation and iterative correction in a single, end-to-end differentiable architecture. The recursive refinement loop is a learned alternative to the homotopy and continuation schemes of classical perturbed-Lambert solvers: rather than following a hand-designed path from the Keplerian to the perturbed solution, the network learns its own sequence of corrections. We evaluate TRM-PL on three test cases of increasing difficulty: single-revolution low-Earth-orbit (LEO) transfers, multi-revolution LEO transfers, and multi-revolution Jovian transfers. Three training paradigms are compared: jointly learning the Lambert solution and the J2 correction; refining the Lambert initial velocity with target-position and J2-corrected velocity supervision; and refining it with target-position supervision alone. Across all cases, the refinement-only approaches are the most reliable. The position-supervised variant reduces the median terminal-position error from 21.7 km to 0.027 km on single-revolution LEO, from 340.9 km to 0.31 km on multi-revolution LEO, all with the same 2.3M-parameter architecture. A single Newton corrector iteration on the TRM-PL output tightens the Jovian median to 0.063 km, yielding compact models accurate enough for embedded deployment.
[LG-221] Statistical Analysis of using the Shapley Value for Sensor Anomaly Localization with Accurate Classifiers
链接: https://arxiv.org/abs/2606.00867
作者: Xubin Fang,Rick S. Blum
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Recent publications have suggested using the Shap- ley value for sensor anomaly/attack localization. We study the performance of such an approach by using mathematically de- fined optimum binary classifiers in the Shapley value calculation. To judge localization performance, we study the ability of the Shapley value of a given sensor observation to determine if that observation is anomalous. First, we prove that for cases with independent sensor observations, an optimized anomaly test using the Shapley value is equivalent to an optimized lower-complexity anomaly test using a single term in the Shapley value calculation, yielding the exact same probability of error. For some popular dependent observation cases involving two sensors, including correlated bivariate Gaussian/Laplacian probability density functions and constant/Gaussian at- tacks/anomalies, we prove that these two tests are fundamentally different, yielding different decision regions and error probabil- ities. Further, we prove that the Shapley value test is sometimes strictly inferior to the other (single term in Shapley calculation) test in certain statistically dependent bivariate Gaussian scenarios with large correlation magnitude and additive attacks/anomalies, while it is strictly superior in others, depending on the sign of the correlation. One can combine these two approaches to obtain a strictly better approach in these cases. These results, which provide the first theoretical statistical analysis of Shapley-based localization, seem very interesting based on the wide acceptance of the Shapley value by many researchers and should encourage further research on this topic. Numerical results are provided which illustrate our findings.
[LG-222] Benchmark Dataset for Catalysis on 2D MXenes
链接: https://arxiv.org/abs/2606.00794
作者: Pavlo Melnyk,Anmar Karmush,Mårten Wadenbäck,Ania Beatriz Rodríguez-Barrera,Johanna Rosen,Michael Felsberg,Jonas Björk
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:Merging first-principles calculations with machine learning (ML), we aim to accelerate the exploration of catalytic behaviour in novel materials. We focus on two-dimensional (2D) Ti _2 CT _y MXenes, whose versatile surface chemistry makes them particularly compelling candidates for catalysis. Resolving their composition and structure under realistic conditions exceeds the reach of standard density functional theory (DFT) due to computational cost. To address this challenge, we generate a comprehensive dataset of 50,000 DFT calculations for training and 10,000 for testing, encompassing both Ti _2 CT _y MXene configurations and molecular systems, along with an additional test dataset with 1000 genuinely new, larger systems to investigate how well models generalise. We train and validate widely used and competitive machine learning interatomic potential (MLIP) models, including EquiformerV2, MACE, MatRIS, and UPET, that accurately predict atomic forces and formation energies – quantities that DFT must repeatedly compute for structural and catalytic investigations – for these 2D materials. This combined DFT-ML framework achieves computational acceleration on the order of approximately 1-4 \cdot 10^3 (on a CPU) while maintaining desired-level accuracy (approximately +/- 10 meV/A for forces and approximately +/- 1 meV for per-atom energies), paving the way for more efficient investigations of MXene catalytic behaviour. Moreover, we perform an extensive qualitative evaluation of the trained models, showcasing the importance of comprehensive simulation-based comparison beyond benchmark metrics. The dataset and the trained models with the code are available at this https URL.
[LG-223] Statistical Testing on Directed Graphs by Surrogate Data Generation
链接: https://arxiv.org/abs/2606.00758
作者: Chun Hei Michael Chan,Alexandre Cionca,Dimitri Van De Ville
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP); Methodology (stat.ME)
*备注: Submitted to IEEE Transactions on Signal and Information Processing over Networks
Abstract:In recent years, graph signal processing has emerged as a powerful framework at the intersection of signal processing and graph theory, providing tools for the analysis of signals defined on nodes while accounting for their relationships represented by edges. These tools have been successfully applied to various settings, including statistical hypothesis testing. In particular, non-parametric approaches based on surrogate generation have been proposed for signals on undirected graphs. However, they are yet to be extended to directed graphs. In this work, we first revisit the notion of stationary graph signals on directed graphs. Specifically, and through the eigendecomposition of the graph shift operator, we define directed graph wide-sense stationary signals. Then, we propose a new framework to generate surrogate graph signals that preserve covariance structure under stationarity assumptions. Null distributions of the test metric can then be constructed from these surrogates and serve as a reference for the empirical data. Finally, we provide guiding examples and an application on real data, in which we compare the performance of our framework with existing techniques for undirected graphs or based on naive permutation, demonstrating feasibility and superiority of the proposed approach.
[LG-224] Cortex and subcortex play distinct roles over learning when cortical memory is limited
链接: https://arxiv.org/abs/2606.00667
作者: Matthew Farrell,Taro Toyoizumi
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Preprint. 19 pages, 4 figures
Abstract:It has been proposed that the brain integrates flexible, computationally expensive cortical processing with simpler, lower-cost subcortical mechanisms to achieve resource-efficient performance greater than that of either system alone. Despite the allure of this perspective, satisfying theoretical frameworks that explore this hypothesis are still limited. We extend existing frameworks in which a model-based module and model-free module learn in tandem by explicitly constraining the memory resources of the model-based module, and investigate the impact of this constraint in a simple decision-making setting. Memory constraints naturally give rise to strategies for allocating memory resources. We evaluate the performance of different strategies in different situations and demonstrate that when the rewarded states change often, it can be advantageous for the model-based module to focus its memory resources not on exploiting the current reward, but on capturing general structure of the environment. This work provides a theoretical foundation for a functional dissociation between cortical and subcortical systems during learning: the cortex supports general structure learning, while subcortical circuits specialize in reward-based learning. We further detail how these hypotheses can be tested on experimental data.
[LG-225] Manifold Diffusion for Structure Generation of Transition Metal Complexes
链接: https://arxiv.org/abs/2606.00666
作者: Luca Schaufelberger,Kjell Jorner
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Transition metal complexes are central to catalysis, drug design, and materials science, with relevant properties strongly sensitive to their three-dimensional geometry. However, the electronic diversity and unconventional bonding environments of transition metal complexes pose a major challenge for accurate structure generation. In this work, we introduce TMCgen, a manifold diffusion machine learning model that efficiently and accurately generates geometries of transition metal complexes. By formulating the diffusion process over the metal-ligand coordination angles, combined with torsional and rotational diffusion of the ligands, TMCgen focuses on the key geometric degrees of freedom of transition metal complexes. TMCgen shows strong performance in generating accurate coordination environments on a diverse set of experimentally derived bioinorganic and organometallic complexes while requiring only few inference steps, enabling efficient generation. Our results demonstrate the potential of manifold-based generative modeling for data-efficient geometry generation, paving the way for property-conditioned design of transition metal complexes.
[LG-226] On Median of Incomplete U-Statistics
链接: https://arxiv.org/abs/2606.00661
作者: Nong Minh Hieu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We establish the finite-sample concentration rate for the Median-of-Incomplete-U-Statistics (MIU), an efficient robust estimator for the expectation of symmetric kernels.
[LG-227] aming the Loss Landscape of PINNs with Noisy Feynman-Kac Supervision: Operator Preconditioning and Non-Asymptotic Error Bounds ICML2026
链接: https://arxiv.org/abs/2606.00643
作者: Nathanael Tepakbong,Hanyu Hu,Chengyu Liu,Xiang Zhou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Statistics Theory (math.ST)
*备注: accepted in ICML 2026 (poster), 59 pages
Abstract:Physics-Informed Neural Networks (PINNs) often train slowly or fail to converge on challenging partial differential equations (PDEs), a behavior recently linked to severely ill-conditioned loss landscapes inherited from the underlying differential operator. We study PINNs augmented with a pointwise data-fidelity term, added at a few points in the domain to the standard residual and boundary losses. We show that this supervision term acts as an operator-level preconditioner: for suitable weights, our comparison bounds guarantee a substantially smaller condition number than under the standard PINN loss, independently of how the pointwise labels are obtained. For a broad class of PDEs admitting a Feynman-Kac (FK) representation, we generate such labels by Monte Carlo averages of the FK functional, resulting in what we call ``FK-PINNs", and using the excess risk decomposition approach, we derive non-asymptotic L^2(\Omega) -error bounds for FK-PINNs with \tanh activation trained by finitely many steps of gradient descent. Along the way, we establish pseudo-dimension bounds for first- and second-order derivatives of \tanh neural networks, which are of independent interest and, to the best of our knowledge, new. Numerical experiments on Poisson, Schrödinger, mean exit time, and committor problems corroborate the theory, and show that FK-PINNs can successfully solve PDEs for which standard PINNs exhibit severe failure modes.
[LG-228] Spectra-Guided Neural Tucker Factorization
链接: https://arxiv.org/abs/2606.00584
作者: Fusheng Wang,Yikai Hou
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:This paper proposes Spectra-Guided Neural Tucker Factorization (SG-NTF) for High-Dimensional and Incomplete (HDI) tensor completion. Circumventing discrete representational limits, SG-NTF maps scalar timestamps into a continuous spectral space to abstract temporal periodicities. Concurrently, a Spatio-Temporal Co-Gating (STCG) mechanism explicitly filters latent interactions via multiplicative modulation on spatiotemporal contexts. Evaluations on real-world HDI tensors verify that SG-NTF maintains competitive completion accuracy with parameter efficiency.
[LG-229] In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise
链接: https://arxiv.org/abs/2606.00520
作者: Zijian Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite p -th moment for p\in\left(1,2\right) , a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient Descent ( \textsfSGD ), without any modification to its update rule, can surprisingly converge in expectation for convex problems with bounded domains, highlighting the potential of classical stochastic gradient methods. Inspired by this recent progress, we provide a comprehensive study of stochastic optimization under heavy-tailed noise and establish new in-expectation convergence results for Stochastic Mirror Descent ( \textsfSMD ) and Accelerated Stochastic Mirror Descent ( \textsfASMD ) in convex optimization, and for \textsfSGD and Stochastic Gradient Descent with Momentum ( \textsfSGDM ) in nonconvex optimization. Notably, our results not only hold without algorithmic changes but also avoid restrictive assumptions, such as bounded domains, imposed in prior work. More importantly, our analysis provides a new, elegant, and powerful framework for studying heavy-tailed stochastic optimization, opening a new route to understanding first-order stochastic gradient methods.
[LG-230] Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction
链接: https://arxiv.org/abs/2606.00483
作者: Lei Huang,Hui Shen,Kuan-Jui Su,Chuan Qiu,Martha Isabel Gonzalez-Ramirez,Anqi Liu,Zhe Luo,Yun Gong,Yipu Zhang,Dawei Li,Chaoyang Zhang,Hong-Wen Deng
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: 16 pages manuscript; 38 pages supplementary
Abstract:Genotype-based cis-expression prediction depends on accurately modeling local regulatory architecture. We present block-sparse Bayesian sparse linear mixed model (bsBSLMM), an extension of Bayesian sparse linear mixed model (BSLMM) that incorporates linkage disequilibrium (LD)-block spike-and-slab sparsity and a transcription start site (TSS)-informed SNP inclusion prior. Across 23,098 genes from GEUVADIS European-ancestry lymphoblastoid cell lines, bsBSLMM retained more predictable genes than BSLMM, LASSO, BLUP, TIGAR elastic net, and TIGAR Dirichlet-process regression under matched evaluation criteria. Compared with BSLMM, bsBSLMM improved held-out prediction performance for most shared genes, with gains driven primarily by LD-block sparsity and further enhanced by the TSS-informed prior. Variants selected by bsBSLMM showed stronger enrichment in GM12878 DNase and H3K27ac regulatory regions than variants selected by BSLMM. In transcriptome-wide association study (TWAS) analysis, bsBSLMM recovered established inflammatory bowel disease signals, including IL23R, and identified additional genome-wide significant genes not detected by BSLMM. Independent validation in the Louisiana Osteoporosis Study reproduced the increased prediction yield across ancestries and recovered biologically relevant bone mineral density pathways in downstream TWAS and gene set enrichment analyses. These results demonstrate that incorporating LD-block structure and biologically informed SNP priors improves cis-expression prediction and enhances downstream TWAS discovery.
[LG-231] Parameter-Free and Group Conditional Online Conformal Prediction
链接: https://arxiv.org/abs/2606.00419
作者: Beepul Bharti,Ambar Pal,Jacopo Teneggi,Jeremias Sulam
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Uncertainty quantification (UQ) is critical for the deployment of machine learning predictors in real-world scenarios where the data distribution may shift over time (i.e., data may not be exchangeable). Online conformal prediction (OCP) methods address this issue at the expense of either (i) group-wise error control or (ii) learning-rate independent implementation. Group-conditional coverage is essential for fairness across different collections of data points and for providing finer UQ guarantees. Parameter-free optimization is crucial for robustness to adversarial and unknown data shifts. We propose a parameter-free algorithm for group-conditional OCP and demonstrate that it achieves the best group-conditional coverage this http URL evaluate our algorithm on synthetic and real-world data, demonstrating that our method not only improves the reliability of existing parameter-free OCP methods but also provides prediction intervals that are comparable in size to well-tuned group-conditional approaches. By unifying group-conditional coverage with parameter-free online algorithms, our work lays a foundation for fair and robust uncertainty quantification in shifting environments.
[LG-232] Riemannian Stochastic Optimization for Sufficient Dimension Reduction
链接: https://arxiv.org/abs/2606.00413
作者: Thibault Pautrel,François Portier
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Sufficient dimension reduction (SDR) makes high-dimensional regression tractable by projecting the covariates onto a low-dimensional subspace that preserves the conditional mean of the response. Existing gradient-based estimators either operate in the ambient space and suffer from the curse of dimensionality, or localize in the reduced space at a per-outer-iteration cost at least quadratic in the sample size. We show that minimizers of the population Minimum Average Variance Estimation (MAVE) risk approximate the same Grassmannian target as the Outer Product of Gradients (OPG), and recast the empirical criterion as a smooth maximization on the Stiefel manifold with closed-form Riemannian gradient. The resulting algorithm, SMAVE, combines sparse projected-space nearest-neighbor localization with Riemannian stochastic gradient ascent. A simplified version comes with almost-sure convergence and a non-asymptotic rate matching the standard non-convex stochastic first-order scaling. Empirically, SMAVE matches or improves on RMAVE’s synthetic subspace recovery at moderate-to-high ambient dimension, and on four real datasets it uniformly improves over OPG and is competitive with or outperforms RMAVE at orders of magnitude lower runtime.
[LG-233] Data-Driven Spectral Prediction for Accelerating Large-Scale Electronic Structure Calculations
链接: https://arxiv.org/abs/2606.00401
作者: Abhiram Badrinarayanan,Davor Davidovic,Edoardo Di Napoli,Jurica Novak,Luigi Genovese,Gustavo Ramirez-Hidalgo,Xinzhe Wu
类目: Computational Physics (physics.comp-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Simulating large molecular systems comprising thousands of atoms requires highly scalable methodologies. While modern Density Functional Theory (DFT) codes exhibit linear scaling, solving the associated large, sparse generalized eigenproblems remains a critical computational bottleneck on exascale architectures. In the context of the LimitX project, we propose a data-driven framework to accelerate these calculations. By shifting the machine learning target from discrete eigenvalues to the coefficients of an interpolating Chebyshev polynomial, and by comparing both all-atom and fragment-based structural representations, we successfully overcome the dimensionality constraints of large-scale spectral prediction. We investigate three machine learning models (Kernel Ridge Regression, Graph Neural Networks, and Random Forests) trained on a novel 2 TB dataset of protein dimers. The predicted spectra provide initial guesses that effectively bypass early Self-Consistent Field (SCF) iterations in BigDFT. Ultimately, these spectral predictors will be deployed to dynamically optimize upcoming rational filter-based eigensolvers, such as FrASE, which is currently in initial development.
[LG-234] Cluster Analysis with Resampling for Validation and Exploration (CARVE)
链接: https://arxiv.org/abs/2606.00327
作者: Kai R. Wycik,Tiffany M. Tang,Tarek M. Zikry,Genevera I. Allen
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
*备注:
Abstract:Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters k , producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.
[LG-235] ERICA: Quantifying Replicability of Cluster Analysis
链接: https://arxiv.org/abs/2606.00302
作者: Siamak K. Sorooshyari,Manuel A. Rivas,Robert Tibshirani
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Despite being ubiquitous in science, clustering remains a technique whose results are not quantitatively scrutinized via a framework. We present an analysis called evaluating replicability via iterative clustering assignments (ERICA) that is applied to a dataset to determine whether clusters are identified in a replicable manner. The pipeline computes a statistic that describes whether structure is found in a dataset. Quantitative visualization methods are presented to answer important questions such as the similarity between clusters, and the identity of points that may be outliers. When tested on synthetic data, the findings show clusters being discovered in a replicable manner. However, we note a possibility for non-replicable results when the pipeline is applied to three gene expression datasets for breast cancer subtype validation. The study underscores the need for rigorous inspection and offers a practical tool for doing so.
[LG-236] Is Zero-Shot Super-Resolution Possible in Operator Learning?
链接: https://arxiv.org/abs/2606.00296
作者: Unique Subedi,Ambuj Tewari
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP)
*备注:
Abstract:Neural operators are often reported to exhibit zero-shot super-resolution, a phenomenon in which a model trained on coarse grids produces accurate predictions on finer testing grids without additional retraining. Despite strong empirical evidence, the theoretical foundations of this phenomenon remain unclear. In this work, we provide a systematic theoretical study of zero-shot super-resolution in operator learning. We first show that zero-shot super-resolution can be information-theoretically impossible even in benign settings such as when the input functions are available over the entire continuum and the ground truth is a simple rank-one linear operator. We then identify H" older smoothness of the output functions as a sufficient condition for zero-shot super-resolution and derive corresponding generalization bounds. Finally, we also validate the identified failure modes through experimental results.
[LG-237] Flow Matching for Convective-Scale Precipitation Downscaling
链接: https://arxiv.org/abs/2606.00281
作者: Tom Wetherell
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Generative machine learning is an increasingly important complement to dynamical downscaling for producing high-resolution precipitation projections, with diffusion models currently the leading approach. Flow matching is a related generative framework that has recently achieved strong results across image, video and other domains, and shown early promise for downscaling. We train a flow matching model to map daily precipitation from 8 km to 2 km over a convective-scale domain centred on Singapore, and benchmark it against CPMGEM, a score-based diffusion model. Flow matching achieves consistently better spatial skill: higher fractions skill score at every precipitation threshold and neighbourhood scale tested, and tighter structure and amplitude components of the SAL score with comparable location skill. However, flow matching underestimates the upper tail of the precipitation distribution, resulting in a dry bias in the climatological mean. These results suggest that flow matching is a competitive generative framework for convective-scale precipitation downscaling, particularly well suited to capturing spatial structure.
[LG-238] Out-of-Distribution generalization of quantile regression with heavy tailed inputs: an SVM approach
链接: https://arxiv.org/abs/2606.00265
作者: Baptiste Leroux,Clément Dombry,Anne Sabourin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 48 pages, 5 figures
Abstract:We study quantile regression in an extrapolation regime where the covariate takes unusually large values. Under regular variation assumptions, extreme observations can be effectively characterized through their angular components, enabling learning strategies that focus on the angle of the most extreme observations. This approach is formalized through the minimization of an asymptotic conditional risk that localizes learning in the tail of the covariate distribution. We propose a novel Support Vector Machine (SVM) framework for extreme quantile regression, leveraging reproducing kernel Hilbert spaces to handle high-dimensional and nonlinear settings. Our method also accommodates unbounded response variables and avoids restrictive transformations. We establish finite-sample learning guarantees under mild regularity assumptions. The proposed framework unifies ideas from statistical learning and multivariate extremes, providing a tractable and theoretically grounded approach to extrapolation. We complement our theoretical findings with an empirical study on river flow data from the Danube, demonstrating the practical relevance of our methods. Comments: 48 pages, 5 figures Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) MSC classes: 62G08 (Primary) 62G32 (Secondary) Cite as: arXiv:2606.00265 [stat.ML] (or arXiv:2606.00265v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2606.00265 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-239] ReFLEX: Length-Generalizable CSI Denoising for MIMO-OFDM via Relative-Frequency Bias
链接: https://arxiv.org/abs/2606.00263
作者: Zhibin Zhang,Robert Potekhin,Ziwei Wan,Vladimir Lyashev,Zhen Gao
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, submitted to IEEE journal
Abstract:This letter studies CSI denoising for MIMO–OFDM with variable NR resource block (RB) allocations. ReFLEX is a length-generalizable Transformer whose frequency attention uses a relative-frequency position bias (RFPB) generated from subcarrier offsets. A single checkpoint handles unseen RB lengths and can be applied to sparse DM-RS observations in the tested RB5/RB10 PUSCH setup without retraining. In a 3GPP~TR~38.901 UMa NLOS channel, ReFLEX achieves about -9.6 ~dB NMSE on unseen RB lengths. In NR PUSCH/UL-SCH simulations, ReFLEX denoising followed by time-frequency interpolation reduces the 10% BLER threshold by about 2–3~dB.
[LG-240] 21cmEMUv3: a hybrid diffusion-LSTM emulator of 21cmFAST summary observables
链接: https://arxiv.org/abs/2606.00219
作者: Daniela Breitman,Andrei Mesinger,Steven G. Murray,Ivan Nikolic,Roberto Trotta
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Machine Learning (cs.LG)
*备注: 12 pages, 6 figures
Abstract:We are witnessing a surge in observations of the cosmic dawn (CD) and epoch of reionisation (EoR), driving an increasing demand for fast and robust theoretical interpretation frameworks. In response, machine learning (ML), and emulation in particular, has emerged as a powerful approach to accelerate and enhance inference pipelines. In this work, we present 21cmEMUv3, an emulator trained on 21cmFASTv3 simulations that model both atomically and molecularly cooling galaxies. 21cmEMUv3 is conditioned on \sigma_8 and ten astrophysical parameters to produce seven summary observables: (i) the cylindrical 21cm power spectrum (PS), emulated for the first time at such high resolution and accuracy across a wide redshift range of z \sim 6–30; (ii) the spherically-averaged 21cm PS; (iii) the mean neutral fraction of the intergalactic medium (IGM); (iv) the mean 21cm spin temperature; (v) the global 21cm signal; (vi) the ultraviolet (UV) luminosity functions (LFs); and (vii) the Thomson scattering optical depth. Notably, the cylindrical 21cm PS is emulated via score-based diffusion, while the remaining six summaries are emulated via long-short term memory (LSTM) networks, all achieving sub-percent median accuracy. We use the emulator to reinterpret current 21cm PS upper limits from HERA, for the first time using state-of-the-art hydrodynamical simulations to inform priors on star formation inside molecularly cooling galaxies. We find that our inferred soft-band X-ray luminosity per unit star formation rate is consistent with extrapolations of high-mass X-ray binaries to the low-metallicity regimes expected in the first galaxies, excluding values below 10^39.2 erg s ^-1M^-1_\odot \rmyr at 95% confidence. Finally, we produce forecasts for the detection of the cosmic 21cm PS with the Square Kilometre Array for different array configurations. The 21cmEMU package is publicly available.
[LG-241] Machine Learning-Based Bitcoin Trading Under Transaction Costs: Evidence From Walk-Forward Forecasting
链接: https://arxiv.org/abs/2606.00060
作者: Andrei Bysik,Robert Ślepaczuk
类目: Trading and Market Microstructure (q-fin.TR); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 42 pages,
Abstract:This paper investigates whether machine learning forecasts of hourly BTC-USDT returns can be converted into economically meaningful trading performance after transaction costs. Using approximately 70,000 hourly observations from 2018-2026, XGBoost, LSTM, and iTransformer are evaluated in a 27-fold walk-forward protocol. All three models produce positive gross trading performance in selected configurations, but naive sign-based strategies fail once transaction costs of ten basis points are imposed. A cost-aware execution filter, which prevents trades only when the forecast magnitude exceeds a transaction-cost-based threshold, sharply reduces turnover and restores profitability in selected configurations. The strongest long-only XGBoost strategy produces annualised returns above 65% with a Sharpe ratio above one. Additional tests show that technical indicators improve performance in selected cases, EGARCH-derived features do not provide uniformly robust gains, and XGBoost is descriptively stronger than the neural alternatives, although bootstrap evidence does not support formal statistical dominance. Loss-function and model-selection effects are secondary and statistically fragile. The results show that the main obstacle in hourly cryptocurrency trading is not only weak predictability, but also the way forecasts are converted into trades.
附件下载


