本篇博文主要内容为 2026-07-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR、MA六个大方向区分。
说明:每日论文数据从Arxiv.org获取,每天早上12:30左右定时自动更新。
提示: 当天未及时更新,有可能是Arxiv当日未有新的论文发布,也有可能是脚本出错。尽可能会在当天修复。
目录
概览 (2026-07-01)
今日共更新741篇论文,其中:
- 自然语言处理共89篇(Computation and Language (cs.CL))
- 人工智能共235篇(Artificial Intelligence (cs.AI))
- 计算机视觉共169篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共164篇(Machine Learning (cs.LG))
- 多智能体系统共20篇(Multiagent Systems (cs.MA))
- 信息检索共11篇(Information Retrieval (cs.IR))
- 人机交互共14篇(Human-Computer Interaction (cs.HC))
多智能体系统
[MA-0] reeAgent : A Generalizable Multi-Agent Framework for Automated Bias Labeling in Forestry via Compiled Expert Rules and Vision-Language Models
【速读】:该论文旨在解决在林业遥感等专家驱动领域中,依赖人工标注数据进行机器学习建模所面临的标注成本高、一致性差以及可扩展性不足的问题,尤其针对树干倾斜度分类这类任务中的标注瓶颈。其核心解决方案是提出一种多智能体系统(Multi-Agent System, MAS),通过将专家构建的决策树作为结构先验,结合视觉-语言模型(Vision-Language Models, VLMs)在每个节点上执行局部语义感知,并利用多智能体投票机制降低VLM固有的随机性影响。进一步地,作者提出了解耦式声明性决策(Decoupled Declarative Decision, D3)框架,实现了对不同专家定义决策结构的零修改泛化能力。实验结果表明,该框架在树倾斜度分类任务上优于传统监督学习基线方法,同时显著减少了对专家标注数据的需求,验证了在保持可解释性的前提下,通过智能体协同调度VLM与专家先验,能够以更低的成本复现专家标注流程的有效性。
链接: https://arxiv.org/abs/2606.31976
作者: Shiyi Chen,Nicholas Saban,Collin Hargreaves,Huiqi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 2 figures
Abstract:Human-labeled data are widely used as reference annotations in ML, despite known variability across annotators in many expert-driven domains. In addition, expert annotation is slow, inconsistent, and remains a major bottleneck for scaling tasks like tree height bias classification in forestry remote sensing. We propose a multi-agent system (MAS) that orchestrates expert decision trees with Vision-Language Models (VLMs), treating the decision tree as a structural prior while VLMs perform localized semantic perception at individual nodes, with multi-agent voting to mitigate VLM stochasticity. We formalize a Decoupled Declarative Decision (D3) Framework that enables zero-modification generalization across diverse expert-defined decision structures. On a tree bias classification testbed, our framework outperforms supervised ML baselines and reduces the amount of expert labeling effort required. These results suggest that agentic orchestration of VLMs with expert priors can reproduce expert-defined labeling procedures at substantially lower annotation cost while maintaining interpretability.
[MA-1] MECoBench: A Systematic Study of Multimodal Agent Collaboration in Embodied Environments
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉具身环境中的协作能力不足的问题,尤其关注其作为具身智能体在真实世界任务中协同工作的潜力与局限。其核心解决方案是提出MECoBench——一个涵盖多样化现实任务、两种协作结构及三种协作模式的多模态具身协作基准测试平台。该平台的关键在于系统性地评估不同协作策略在复杂环境下的有效性,揭示协作收益与协调复杂度之间的权衡关系,并验证通信机制对协作性能的决定性作用。研究发现,协作显著提升任务完成率与环境鲁棒性,但最优协作模式受团队规模和模型能力影响,且在噪声先验和探索性条件下表现出更强的稳定性。MECoBench为理解多模态具身协作的机制与边界提供了可复现的实验框架。
链接: https://arxiv.org/abs/2606.31966
作者: Qingyun Liu,Jiwen Zhang,Jingyi Hu,Siyuan Wang,Zhongyu Wei
机构: Fudan University(复旦大学); Shanghai Innovation Institute(上海创新研究院); The Chinese University of Hong Kong(香港中文大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Recent multimodal large language models (MLLMs) have strong potential as embodied agents, but their ability to collaborate in visually grounded environments remains underexplored. To address this gap, we introduce MECoBench, a multimodal embodied cooperation benchmark with an evaluation platform spanning diverse real-world tasks, two cooperation structures, and three collaboration modes. Through extensive experiments across various MLLMs, we summarize three key findings: (i) Collaboration generally improves embodied task completion, but its benefits depend on balancing collaborative gains against coordination complexity. (ii) Communication is essential to collaboration gains, while the best collaboration mode depends on team size and model capability. (iii) Moreover, collaboration improves robustness under noisy priors and exploration conditions. Generally, MECoBench provides a systematic testbed for understanding the mechanisms and limits of multimodal embodied collaboration. Code and dataset are available at this https URL.
[MA-2] Analytic Cut in Epistemic Logics with Distributed Knowledge
【速读】:该论文旨在解决多主体认知逻辑中分布式知识(distributed knowledge)的证明论问题,具体针对基于K45、KD45和S5框架的分布式知识时态逻辑的序贯演算(sequent calculus)构造与性质分析。传统序贯演算在这些系统中均不满足切割消去(cut elimination)性质,导致其证明论强度受限。本文的关键解决方案在于采用Takano(2018)提出的策略,通过引入分析性切割(analytic cut) 机制,将切割公式限制为结论公式的子公式集合,从而在所有三类系统中成功建立分析性切割性质。这一改进不仅恢复了良好的证明论行为,还作为直接推论保证了Craig插值定理在所有考虑的逻辑中的成立。此外,研究进一步表明,当允许空组合作为分布式知识算子的解释对象时(此时分布式知识退化为全局模态),上述所有证明论结果依然保持有效。
链接: https://arxiv.org/abs/2606.31886
作者: Ryo Murai(Independent Researcher),Sizhuo Liu(Hokkaido University),Katsuhiko Sano(Hokkaido University)
机构: 未知
类目: Multiagent Systems (cs.MA); Logic in Computer Science (cs.LO)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:Distributed knowledge is a notion of group knowledge studied in multi-agent epistemic logic. Semantically, the distributed knowledge of a group is interpreted via an accessibility relation given by the intersection of the epistemic accessibility relations of the agents in that group. This paper investigates sequent calculi for epistemic logics of distributed knowledge based on K45, KD45, and S5. While cut elimination holds in existing sequent calculi for modal logics K45 and KD45, it fails in all the systems mentioned above. Instead, we establish the analytic cut property for all three systems by adapting Takano’ s (2018) strategy, which restricts the cut formulas to the set of subformulas of the conclusion of the cut rule. As a corollary, the Craig interpolation theorem holds for all logics considered. We also show that all proof-theoretic results remain valid when the empty group is allowed for the distributed-knowledge operator, in which case the distributed knowledge for the empty group is interpreted as the global modality.
[MA-3] Inquisitive Action Logic
【速读】:该论文旨在解决多智能体系统中关于行动推理的语义局限性问题,即传统模态逻辑仅关注代理能否强制实现某种结果属性,而忽略了代理通过其行动所决定的结果中的具体方面。为此,论文提出了一种名为“好奇行动逻辑”(InqAL)的新型多智能体模态逻辑框架,其核心创新在于将代理对结果的确定性(agentive determination)建模为涉及疑问的模态命题,从而能够形式化地表达代理在行动中“决定”了哪些结果特征。其解决方案的关键在于:基于并发博弈结构(concurrent game structures),将传统的邻域逻辑扩展为一种包含疑问语义的多智能体逻辑,使得逻辑表达能力可精确刻画代理的实际效用函数(actual effectivity functions),并建立了从多智能体邻域框架到并发博弈结构的充要条件表示定理。此外,论文给出了InqAL的公理化系统,并通过有限模型性质证明了其完备性与可判定性,为多智能体系统中行动与信息交互的逻辑分析提供了坚实的形式基础。
链接: https://arxiv.org/abs/2606.31866
作者: Ivano Ciardelli(University of Padua)
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:We introduce inquisitive action logic, InqAL, a multi-agent modal logic for reasoning about action. While traditional approaches focus on what properties of the outcome an agent can force, InqAL also captures what aspects of the outcome an agent determines through their actions. As we argue, such claims of agentive determination are naturally analyzed as modal claims involving questions. Technically, InqAL is a multi-agent extension of inquisitive neighborhood logic based on concurrent game structures. With respect to statements, it is expressively equivalent to the individual-agent fragment of the socially friendly coalition logic recently proposed by Goranko and Enqvist. We present an axiomatization of InqAL and prove completeness and decidability via the finite model property. Along the way, we establish a representation theorem for actual effectivity functions, associating to an agent the sets of outcomes corresponding to their possible actions; we give exact conditions under which a multi-agent neighborhood frame arises from a concurrent game structure. Comments: In Proceedings AiML 2026, arXiv:2606.29444 Subjects: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA) Cite as: arXiv:2606.31866 [cs.LO] (or arXiv:2606.31866v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2606.31866 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 447, 2026, pp. 222-241 Related DOI: https://doi.org/10.4204/EPTCS.447.13 Focus to learn more DOI(s) linking to related resources
[MA-4] he Logic of Data Access and Data Exchanges
【速读】:该论文旨在解决传统动态知识逻辑(Dynamic Epistemic Logic, DEL)在建模非命题型知识(non-propositional knowledge)及其动态更新方面的局限性,特别是针对个体或群体对变量取值的条件性知识(conditional knowledge of a number)以及其可被缩小至有限可能性范围的能力。现有框架难以精确表达“某主体在给定条件下可将变量x的可能取值缩减至至多N个”这一语义,也无法对这些可能值进行命名与比较。为此,论文提出一种扩展的逻辑系统:在标准的命题知识模态基础上,引入用于表达条件性非命题知识的算子,并进一步推广为能够刻画知识压缩能力的算子(即最多保留N个可能值)。关键创新在于通过基于极小化算子的确定性描述(definite descriptions),实现对可能值中最小者(按固定序)的命名与推理。在此静态逻辑基础上,论文构建了类DEL的动态扩展,支持对“数据交换事件”的统一建模,涵盖私密/公开命题告知、秘密数据库劫持及开源数据共享等复杂场景,其中信息以“数据块”形式整体转移。研究给出了所构造逻辑系统的完备公理化体系,并证明了其可判定性与共表达性(co-expressivity),从而为复杂情境下的知识动态演化提供了形式化分析工具。
链接: https://arxiv.org/abs/2606.31858
作者: Alexandru Baltag(ILLC, University of Amsterdam),Sonja Smets(ILLC, University of Amsterdam)
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:We investigate a new logic that extends Dynamic Epistemic Logic (DEL), by combining standard epistemic modalities for (individual and distributed) propositional knowledge with operators for (conditional) non-propositional knowledge of a number (in which an agent or a group have knowledge of the value of some variable x, conditional on some additional information). We also generalize these operators, by considering formulas that express the fact that an agent or group can (conditionally) narrow down the possible values of the variable x to at most N possibilities (for some natural number N). In order to name and compare such hypothetical values, we extend the logic further with definite descriptions based on minimization operators, denoting the least of the N possible values of x (according to some fixed order) that are considered possible by the agent or group. On this static base, we consider DEL-style extensions with dynamic modalities for general ‘data-exchange events’ (covering private and public propositional announcements, but also secret hacking of a private database, or public sharing of one’s data via open-source repositories, etc.). In such scenarios, whole ‘chunks’ of information may be exchanged or modified: once access to a given source is gained, all the ‘data’ stored at that specific location becomes available. We give complete axiomatizations for the resulting logics, and prove their decidability and co-expressivity.
[MA-5] Resolving Asynchronous Distributed Knowledge
【速读】:该论文旨在解决传统分布式知识逻辑(distributed knowledge)在处理异步信息共享动态时的局限性。现有逻辑如Agotnes和Wang提出的“解析分布式知识逻辑”(Resolving Distributed Knowledge Logic),虽能刻画多主体间知识共享与同步更新机制,但其假设所有参与者均处于全局同步环境(即存在共同的时间感知),且参与者为无记忆状态,无法反映现实分布式系统中异步、局部可观测的通信特性。为此,本文提出一种异步解析分布式知识逻辑(Resolving Asynchronous Distributed Knowledge Logic),其核心在于引入基于历史的语义框架——真理不仅依赖于当前世界状态,还依赖于个体可观察到的历史路径(history of prior resolutions),而每个代理仅能观测包含自身的分组的知识共享事件,对不包含自身的群体行为保持无知。这一设计使得逻辑更贴近真实分布式计算场景中的异步性与局部可观测性,尽管带来了公理化体系重构等技术挑战(如原有同步条件下分辨率与分布式知识之间的公理不再成立),但显著提升了模型在分布式系统建模中的表达力与适用性。
链接: https://arxiv.org/abs/2606.31855
作者: Philippe Balbiani(IRIT, CNRS-INP-University of Toulouse),Hans van Ditmarsch(IRIT, CNRS-INP-University of Toulouse),Clara Lerouvillois(IRIT, CNRS-INP-University of Toulouse, IHPST, CNRS-Paris 1 Pantheon Sorbonne)
机构: 未知
类目: Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:There are by now various epistemic modal logics with intersection modalities for distributed knowledge and intersection update modalities for dynamic phenomena like agents sharing (all their) information, agents receiving information from other agents, and full information protocols. One of those is the logic of Resolving Distributed Knowledge, by Agotnes and Wang. It has distributed knowledge modalities for arbitrary subsets of the set of all agents and it also has so-called resolution modalities for arbitrary subsets of agents sharing their knowledge. In that logic, the agents not involved in the knowledge sharing are aware of the agents sharing knowledge, agents are memory-less, and the kind of dynamics represents synchronous updates, where there is common awareness of the global clock. In contrast, in this contribution we present a logic for Resolving Asynchronous Distributed Knowledge. It is an asynchronous generalization of the synchronous logic of resolving distributed knowledge. The logical semantics is history-based: truth is not only with respect to a given world in a model, but also with respect to a given history of prior resolutions, of which each individual agent can only observe a part. In particular, an agent is unaware of resolutions for groups of agents not including her. As is to be expected, this comes with many technical complications, for example concerning the axiomatization. The synchronous axioms relating resolution to distributed knowledge are now invalid. The modelling advantages of such an asynchronous novel logic, for distributed computing and similar areas, are however substantial and a major asset.
[MA-6] ForecastAgentS earch: Towards a Multi-Expert Agent Search System for Geopolitical Event Forecasting
【速读】:该论文旨在解决地缘政治事件预测中的复杂性问题,即如何在动态变化的区域背景、多源异构事件信号及未来结果高度不确定性的情境下,实现准确且可解释的预测。其核心挑战在于整合多样化信息源与专家观点,并有效应对知识碎片化与判断偏差。解决方案的关键在于提出一种名为ForecastAgentSearch的初步框架,将地缘政治事件预测建模为多专家智能体(multi-expert agent)搜索问题:系统首先解析任务上下文,继而基于区域知识、领域专长、可靠性与互补性等维度对候选专家智能体进行检索与排序;被选中的智能体提供专业化分析,再通过多智能体协同机制整合输出最终预测结果,包含解释性说明与不确定性感知。该方法的核心创新点在于构建可搜索、可评估的智能体驱动预测体系,突破传统模型在动态认知协同与可解释性方面的局限。
链接: https://arxiv.org/abs/2606.31665
作者: Miaomiao Cai,He Chang,Yunshan Ma,See-kiong Ng
机构: National University of Singapore(新加坡国立大学); Communication University of China(中国传媒大学); Singapore Management University(新加坡管理大学)
类目: Multiagent Systems (cs.MA)
备注:
Abstract:Geopolitical event forecasting is a challenging task, as it requires understanding complex regional contexts, dynamic event signals, and uncertain future outcomes. Recent advances in large language model agents provide new opportunities for building forecasting systems that can reason with diverse sources and expert perspectives. In this paper, we present \textitForecastAgentSearch, a preliminary framework that formulates geopolitical event forecasting as a multi-expert agent search problem. Given a forecasting query, the system first analyzes the task context, then searches and ranks relevant expert agents based on their regional knowledge, domain expertise, reliability, and complementarity. The selected agents provide specialized analyses, which are further coordinated to generate a final forecast with explanations and uncertainty awareness. We discuss the key design challenges of agent profiling, expert retrieval, ranking, and multi-agent coordination, and outline possible evaluation protocols for future development. This work aims to provide an initial step toward searchable and reliable agent-based forecasting systems.
[MA-7] A Tutorial on Autonomous Fault-Tolerant Control Using Knowledge-Grounded LLM Agents
【速读】:该论文旨在解决过程工业中故障恢复仍高度依赖操作员、尤其在故障超出预设监督逻辑范围时缺乏有效支持的问题。其核心挑战在于如何在保证安全的前提下,为操作员提供智能化的恢复决策辅助。解决方案的关键是将大型语言模型(Large Language Model, LLM)作为受约束的监督规划器(constrained supervisory planner),利用面向具体装置的领域知识生成故障恢复动作建议,并通过外部验证器(基于符号逻辑或仿真)对每一项提议进行可行性检验,确保仅执行合法且安全的操作。该框架提出三个关键设计维度:适用于LLM代理的故障恢复模式、用于区分可接受与不可接受提议的验证策略,以及由延迟、知识工程复杂性、安全集成需求和模型生命周期管理所决定的部署约束。为实现可直接应用,论文提供了两个开源的可执行Python环境,复现了经典的模块化混合单元与连续搅拌釜反应器案例,并扩展了可配置故障场景及自定义恢复与验证方法的接口。
链接: https://arxiv.org/abs/2606.31635
作者: Javal Vyas,Milapji Singh Gill,Artan Markaj,Felix Gehlhoff,Mehmet Mercangöz
机构: Imperial College London (帝国学院伦敦大学); Helmut Schmidt University (赫尔穆特·施密特大学)
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Fault recovery in process plants still relies heavily on plant operators, especially when faults fall outside predefined supervisory logic. Operators interpret alarms, procedures, P\IDs, interlocks, and process trends, then decide how to move the plant to a safe operating mode without triggering a shutdown. This paper examines how Large Language Model (LLM) agents can support such recovery decisions. The proposed framework treats the LLM as a constrained supervisory planner. It uses plant-specific knowledge to propose recovery actions, and every proposal is checked by an external validator (symbolic or simulation-based) before actuation. The paper develops three design dimensions for applying the framework: the recovery patterns for which LLM agents are useful, the validation strategies that separate admissible from inadmissible proposals, and the deployment constraints imposed by latency, knowledge engineering, safety integration, and model lifecycle management. To make the framework directly usable, two openly available executable Python environments are provided. Both re-implement established case studies, a modular mixing module and a continuous stirred-tank reactor, extended with configurable faults and defined interfaces for custom recovery and validation methods.
[MA-8] A Large-Scale Empirical Evaluation of MMAO Under Fair-Budget Continuous and Discrete Benchmarks
【速读】:该论文旨在解决生成式优化算法在跨域任务中资源分配机制的可信度与适应性问题,特别是检验代谢多智能体优化器(Metabolic Multi-Agent Optimizer, MMAO)所依赖的闭环资源分配原则在更严格、更标准且显式预算控制下的连续与离散优化基准中的有效性。其核心解决方案的关键在于通过一套强化的实证评估协议,系统验证MMAO在多种复杂环境下的性能表现,包括8个CEC2017连续优化函数(10D与30D各20次独立运行)和5个TSPLIB旅行商问题实例(各20次运行),并引入多项轨迹级诊断指标(如群体预算动态、成功率、角色演化及种群更替率),辅以OR-Library多重背包问题子集以拓展离散优化证据链。实验结果表明,MMAO在连续与组合优化任务上均显著优于外部基线方法(如PSO-lite、ES-lite及迭代贪心2-opt路径基准),且其消融变体仍明显优于基线,证实了其内在资源再分配机制在证据压力下的有效性。因此,论文将MMAO定位为一种基于基准验证的跨域自适应框架,其最明确的贡献在于内生性资源重分配能力,而当前主要局限在于机制隔离的精细程度不足及更具竞争性的广泛对比尚待完善。
链接: https://arxiv.org/abs/2606.31584
作者: Jinliang Xu,Liping Ma
机构: The Seventh Medical Center of Chinese PLA General Hospital (中国人民解放军总医院第七医学中心)
类目: Neural and Evolutionary Computing (cs.NE); Multiagent Systems (cs.MA)
备注:
Abstract:This paper evaluates the Metabolic Multi-Agent Optimizer (MMAO) under a stricter empirical protocol rather than reintroducing the framework itself. The study asks whether MMAO’s closed-loop resource-allocation principle remains credible under broader, more standard, and more explicitly budget-controlled continuous and discrete benchmarks. The main completed matrix covers eight CEC2017 functions at 10D and 30D with 20 seeds each, and five TSPLIB instances with 20 seeds each, together with stronger reproducible baselines including PSO-lite, ES-lite, and an iterated-greedy 2-opt route baseline. We further add trajectory-level diagnostics for communal budget, success rate, role evolution, and population turnover, plus an auxiliary OR-Library multiple-knapsack slice to extend the discrete evidence beyond routing. Under this protocol, MMAO clearly outperforms the external baseline set on the continuous side and on the TSPLIB side, while the ablation variants remain much closer to the full method than the external baselines are. We therefore position MMAO as a benchmark-backed cross-domain adaptive framework whose most clearly validated value is endogenous resource redistribution under evidence pressure, while also noting that the strongest remaining gap is not basic workability but sharper mechanism isolation and broader competition-grade comparison.
[MA-9] Holonic Active Distillation for Scalable Multi-Agent Learning in Multi-Sensor Systems
【速读】:该论文旨在解决传感器网络在开放环境中因动态拓扑变化带来的可扩展性、自适应能力及知识迁移难题,尤其针对新子系统频繁加入或退出场景下的系统稳定性与学习效率问题。其核心解决方案是提出一种基于全息多智能体系统(Holonic Multi-Agent System, HMAS)的全息主动蒸馏(Holonic Active Distillation)架构,其中引入了聚类流式主动蒸馏(Clustered Stream-Based Active Distillation, CSBAD)框架:通过专用的学生模型在本地收集数据,向教师模型查询伪标签,并依据相似性对传感器进行聚类分组,实现局部专精与全局泛化的协同平衡。该方法显著提升了系统对传感器动态离散与重入的适应能力,同时揭示了增量模型更新、系统重构与可扩展性边界之间的权衡关系。研究结果表明,全息学习机制在多传感器系统中具有显著优势,但也暴露了模型漂移与长期适应性等关键挑战。
链接: https://arxiv.org/abs/2606.31578
作者: Dani Manjah,Tim Bary,Benoît Macq,Stéphane Galland
机构: 未知
类目: Multiagent Systems (cs.MA)
备注: 21 pages, 5 figures, 2 tables, accepted to EMAS 2025
Abstract:The rapid expansion of sensor-based networks introduces major challenges in scalability, adaptability, and knowledge transfer, especially in open environments where new subsystems can dynamically join or leave. In this work, we propose a Holonic Active Distillation architecture within a Holonic Multi-Agent System (HMAS) to address these issues. Our approach integrates Clustered Stream-Based Active Distillation (CSBAD), a framework in which specialized student models collect local data, query pseudo-labels from teacher models, and cluster into groups of similar sensors. Results show that the holonic organization balances local specialization with global generalization, while efficiently adapting to sensor departures and re-integrations. We also analyzed trade-offs among incremental model updates, system reorganization, and scalability limits. Our findings highlight the advantages of holonic learning for multi-sensor systems while identifying key challenges related to model drift and long-term adaptation. Comments: 21 pages, 5 figures, 2 tables, accepted to EMAS 2025 Subjects: Multiagent Systems (cs.MA) Cite as: arXiv:2606.31578 [cs.MA] (or arXiv:2606.31578v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2606.31578 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: 2025 13th International Workshop on Engineering Multi-Agent Systems Related DOI: https://doi.org/10.1007/978-3-032-18011-7_6 Focus to learn more DOI(s) linking to related resources
[MA-10] DataEvolver: Self-Evolving Multi-Agent Data Construction for Text-Rich Image Generation
【速读】:该论文旨在解决文本丰富的图像生成(text-rich image generation)中训练数据构建效率低下的问题,核心挑战在于如何在保证图像视觉真实性的同时,实现文本的可读性、语义一致性与版式协调性。现有数据构建流程普遍采用静态的“爬取-过滤-冻结”范式,即一次性收集并筛选样本后固定用于训练,导致被拒绝的样本虽包含诸如光学字符识别(OCR)错误和语义错配等关键失败信号,却往往被丢弃,从而在后续构建轮次中重复相同的错误模式。为克服这一局限,本文提出DataEvolver——一种自演化多智能体框架,其关键在于将数据构建过程建模为反馈驱动的策略演化过程:通过检索器(Retriever)获取候选样本,验证器(Verifier)评估质量并标注拒绝原因,批评者(Critic)提炼回合级反馈形成语义层面的总结,生成器(Generator)针对覆盖不足区域进行目标导向合成,最终将更新后的反馈记忆用于指导下一构建轮次。实验表明,在相同数据预算下,DataEvolver生成的训练数据更具价值;在PixArt-alpha 0.75M规模下,其在TextScenesHQ和LongTextBench上的OCR-F1分别较最强基线提升85.3%和35.3%,且结果在跨模型任务(Show-o2)上仍具泛化能力,证明了被拒绝样本蕴含的可行动反馈对提升文本丰富图像数据质量具有显著价值。
链接: https://arxiv.org/abs/2606.31537
作者: Siyu Yan,Yizhen Gao,Yilin Wang,Dongxing Mao,Alex Jinpeng Wang
机构: Central South University (中南大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Text-rich image generation is one of the most challenging settings in image generation, since models must simultaneously produce visually realistic images and render legible, semantically aligned, and layout-consistent text. Existing data pipelines usually follow a static crawl-filter-freeze paradigm. They collect candidate samples, filter them once, and freeze the accepted data for training. However, rejected samples are usually discarded, although they often contain useful failure signals such as OCR errors and semantic mismatches. As a result, later construction rounds may repeat the same failure modes. To address these limitations, we propose DataEvolver, a self-evolving multi-agent framework for text-rich image data construction. DataEvolver treats data construction as feedback-driven construction policy evolution. A Retriever collects candidate samples, a Verifier assigns quality scores and rejection causes, a Critic summarizes round-level feedback into semantic feedback, and a Generator completes under-covered regions through targeted synthesis. The updated feedback memory then guides the next construction round. Experiments on text-rich image generation benchmarks show that DataEvolver produces more useful training data than fixed-dataset baselines under matched data budgets. At the 0.75M scale on PixArt-alpha, DataEvolver improves OCR-F1 over the strongest baseline by 85.3 percent on TextScenesHQ and 35.3 percent on LongTextBench. The improvements are consistent across both evaluated benchmarks and also transfer to Show-o2, indicating that the benefit of DataEvolver is not tied to a single downstream generator. These results suggest that rejected samples can provide actionable feedback for improving text-rich image data construction.
[MA-11] Governance Gaps in Agent Interoperability Protocols: What MCP A2A and ACP Cannot Express
【速读】:该论文旨在解决当前自主代理间互操作协议(如MCP、A2A、ACP、ANP和ERC-8004)在支持企业级受治理的代理社区方面存在的根本性缺失问题。尽管这些协议已成熟至可实现身份认证、能力发现、工具访问与消息交换,但其设计仍聚焦于任务导向的协同,难以满足复杂组织中需遵循治理约束的集体决策需求。解决方案的关键在于提出一个基于组织理论、多智能体系统文献及企业治理标准的六维治理需求分类法(包括成员资格、协商过程、投票机制、异议保留、人工升级路径以及审计与回放),并以此对现有协议进行系统性差距分析。分析结果表明,所有协议均缺乏完整的投票与异议保留能力,协商过程亦普遍缺失或仅部分支持,且无一协议具备构建受治理代理社区所需的全部原语。研究进一步区分了可通过协议扩展机制弥补的“可扩展性缺口”与需引入全新架构层才能解决的“结构性缺口”,并评估其时间敏感性。最终结论指出,代理社区治理并非现有协议功能的补充缺陷,而是一个亟待建立的、高于当前互操作性标准的独立架构层级。
链接: https://arxiv.org/abs/2606.31498
作者: Richard Kang,Yudho Diponegoro
机构: 未知
类目: Multiagent Systems (cs.MA); Software Engineering (cs.SE)
备注:
Abstract:Agent interoperability protocols (MCP, A2A, ACP, ANP, and ERC-8004) have rapidly matured to enable identity, capability discovery, tool access, and message exchange between autonomous agents. However, as enterprises deploy heterogeneous agent fleets that must make collective decisions under governance constraints, a question arises: can these protocols support governed agent communities, or only task-oriented coordination? We present a systematic gap analysis applying a six-dimension governance requirements taxonomy (membership, deliberation, voting, dissent preservation, human escalation, and audit/replay) derived from organizational theory, multi-agent systems literature, and enterprise governance standards. We analyze each protocol’s specification against this taxonomy, classifying capabilities as Supported, Partial, or Absent. The resulting gap matrix reveals that voting and dissent preservation are universally absent across all five protocols, deliberation is absent or at most partial, and no protocol encodes the full set of primitives required for governed agent communities. We distinguish extensible gaps (addressable through protocol extension mechanisms) from structural gaps (requiring a new architectural layer) and assess time-sensitivity based on observed protocol evolution velocity. The analysis establishes that agent community governance constitutes a missing architectural layer above current interoperability standards, not a missing feature within them.
[MA-12] MultiUAV-Plat: An LLM -Oriented Platform Benchmark and Framework for Multi-UAV Collaborative Task Planning
【速读】:该论文旨在解决大语言模型(LLM)在多无人机(multi-UAV)协同任务规划中缺乏系统性评估框架的问题。现有无人机仿真平台多聚焦于动力学、感知或底层控制,而现有的LLM智能体评测基准又未能充分涵盖空中机器人特有的约束条件,如部分可观测性、空间覆盖要求、无人机分配以及多机协同等关键挑战。为此,论文提出MultiUAV-Plat——一个面向LLM智能体的轻量化、易用型多无人机协同任务规划仿真平台,其核心创新在于通过简洁的RESTful API、面向智能体的观测信息、基于角色的信息访问机制、隐藏的验证逻辑及可选的2D/3D可视化,使智能体能够以真实工具交互方式完成任务,而非依赖特权仿真器访问。在此平台上构建的MultiUAV-Plat Benchmark包含75个任务会话、1500个自然语言任务和9396次验证检查,覆盖目标分配、区域搜索与区域巡检等典型场景。进一步提出的Agent4Drone是一种面向任务的LLM智能体框架,将多无人机行为结构化为记忆、观测、任务理解、规划、执行与验证六个模块。在全配对基准测试中,Agent4Drone实现57.9%的任务通过率、74.6%的平均任务检查通过率和72.0%的全局检查通过率,显著优于ReAct基线(分别为30.6%、47.9%和43.1%),并将总任务失败率从32.4%降低至12.9%。结果表明,MultiUAV-Plat及其基准测试体系为研究在现实信息与执行约束下的LLM驱动多无人机自主决策提供了可复现的基础平台。
链接: https://arxiv.org/abs/2606.31073
作者: Sheng Zhang,Qinglin Li,Yuechao Zang,Xueqin Huang,Yijia Fu,Cheng Zhu
机构: National University of Defense Technology (国防科技大学)
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:
Abstract:Large language models (LLMs) provide a promising interface for high-level robotic task planning, but their use in multi-UAV collaboration remains difficult to evaluate systematically. Existing UAV simulators mainly emphasize dynamics, perception, or low-level control, while existing LLM-agent benchmarks rarely capture aerial-robotics constraints such as partial observability, spatial coverage, UAV assignment, and multi-vehicle coordination. To bridge this gap, we present MultiUAV-Plat, a lightweight, easy-to-use, LLM-agent-oriented simulation platform for multi-UAV collaborative task planning. The platform exposes concise RESTful APIs, agent-facing observations, role-based information access, hidden validation logic, and optional 2D/3D visualization, allowing agents to solve missions through realistic tool interaction rather than privileged simulator access. Built on this platform, the MultiUAV-Plat Benchmark contains 75 mission sessions, 1500 natural-language tasks, and 9396 validation checks across target assignment, area search, and area assignment and patrol scenarios. We further propose Agent4Drone, a task-specific LLM agent framework that structures multi-UAV behavior into memory, observation, task understanding, planning, execution, and verification. In a full paired benchmark comparison, Agent4Drone achieves a 57.9% task pass rate, a 74.6% average task check pass rate, and a 72.0% global check pass rate, substantially outperforming a ReAct baseline at 30.6%, 47.9%, and 43.1%, respectively. Agent4Drone also reduces the total failed task rate from 32.4% to 12.9%. These results demonstrate that MultiUAV-Plat and MultiUAV-Plat Benchmark provide a reproducible foundation for studying LLM-driven multi-UAV autonomy under realistic information and execution constraints.
[MA-13] he Organizational Behavior of Agent ic AI: Collective Intelligence in Human-Agent Workflows
【速读】:该论文旨在解决的问题是:当生成式 AI(Generative AI)以多个智能体协同工作的形式部署时,这些智能体集体是否表现出可与人类组织行为相类比但又有所区别的组织行为特征。其核心解决方案的关键在于提出“情境交易成本”(contextual transaction cost)作为连接相似性与差异性的核心机制。研究指出,尽管智能体集体在分工、协调、流程化运作、跨边界协作及产生集体成果等方面呈现出类似人类组织的特征,但其行为基础并非源于动机、身份认同、信任、雇佣关系、社会化或道德问责等社会性要素,而是依赖于情境架构——包括提示(prompts)、记忆系统、操作痕迹、知识框架、工具调用、验证机制和权限控制等技术性结构。通过计算建模、合成任务仿真、真实大语言模型(LLM)智能体行为轨迹分析及鲁棒性检验发现,模仿人类模式的系统常因信息损耗性传递、相关性审议与验证负担而表现不佳;相比之下,采用共享状态与自适应机制的系统在提升情境持久性、可观测性和任务敏感性方面更具优势。因此,本文为组织研究提供了新的理论视角,将生成式 AI 视为一种新兴的“组织化对象”,并明确了人机组织行为协同支持集体智能的界面条件。
链接: https://arxiv.org/abs/2606.30986
作者: Canhui Liu
机构: University College London(伦敦大学学院); The AI Hub in Generative Models(生成模型人工智能中心)
类目: Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注:
Abstract:Agentic artificial intelligence is increasingly deployed not as a single assistant but as a collective of planners, solvers, reviewers, memory managers, tool users, and orchestrators. These systems are entering organisational workflows under familiar labels such as teams, managers, committees, markets, and workflows. This article asks whether such agent collectives exhibit organisational behaviour in a sense that is analytically comparable to, yet distinct from, human organisational behaviour. I argue that agentic AI is a partial organisational analogue. It resembles a human organisation because it differentiates work, coordinates interdependence, performs recurrent routines, crosses boundaries, and produces collective outcomes. It differs because these patterns are not sustained by motivation, identity, trust, employment, socialisation, or moral accountability. They are sustained by context architecture: prompts, memory, traces, schemas, tools, validators, and permissions. The article develops contextual transaction cost as the central mechanism linking these similarities and differences. Computational theorising, synthetic task simulations, real LLM agent traces, and robustness analyses show that human-imitation forms often underperform when they add lossy handoffs, correlated deliberation, and verification burdens, whereas shared-state and adaptive forms perform better when they make context durable, inspectable, and task-contingent. The article contributes to organisation studies by theorising agentic AI as an emerging object of organising and by specifying the interface conditions under which human and agentic organisational behaviour can jointly support collective intelligence.
[MA-14] HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation
【速读】:该论文旨在解决多智能体强化学习(MARL)在部分可观测环境下,如何有效利用形式化规范来指导学习过程的问题。传统方法如奖励塑形存在表达能力有限、缺乏数学严谨性等局限,而形式化规范虽具备数学严谨性、目标与约束的强表达能力以及战术定义能力,却在MARL领域尚未得到充分应用。为此,本文提出HyPOLE框架,通过引入超性质(hyperproperties)及时间逻辑HyperLTL,赋予智能体对全局行为模式的精确建模能力,从而在部分可观测场景下实现更优的学习策略。其解决方案的关键在于将集中式训练与分布式执行(CTDE)机制与基于HyperLTL的形式化规范相结合,利用超性质对多智能体协同行为进行高层语义约束,进而合成出满足复杂逻辑目标的去中心化策略,实验在SMAC、MessySMAC和WildFire基准上验证了该方法相较于基线显著的性能优势。
链接: https://arxiv.org/abs/2606.30966
作者: Arshia Rafieioskouei,Tzu-Han Hsu,Matthew Lucas,Borzoo Bonakdarpour
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注:
Abstract:Formal specification is a powerful tool to guide the learning process and provides significant advantages over reward shaping: (1) mathematical rigor; (2) expressiveness to specify objectives and constraints, and (3) the ability to define tactics to achieve objectives. However, these benefits remain largely unexplored in the context of Multi-Agent Reinforcement Learning (MARL). This paper introduces HyPOLE, a novel framework for MARL under partial observability, where learning is guided by the expressive power of the so-called hyperproperties and, in particular, the temporal logic HyperLTL. We integrate Centralized Training for Decentralized Execution (CTDE) techniques with HyPOLE to synthesize decentralized policies, and our evaluation on SMAC, MessySMAC, and WildFire benchmark demonstrates clear advantages over baselines.
[MA-15] RoPoLL: Robust Panel of LLM Judges
【速读】:该论文旨在解决生成式 AI(Generative AI)评估中广泛使用的“大模型评审团”(PoLL, Panel of LLM Evaluators)在面对偏差污染时的统计不鲁棒性问题。尽管PoLL通过多评委共识提升评估可靠性,但其在存在任何正比例污染(如模式坍缩、谄媚行为、安全拒绝等典型的模型偏差)的情况下,均会产生无界偏差,且该偏差不受评审团规模影响。其核心问题是:传统平均聚合策略对极端异常值敏感,无法抵御恶意或系统性偏差。本文的关键解决方案是提出RoPoLL(Robust Panel of LLM-as-Judge),将评审团共识建模为经典鲁棒均值估计问题,采用几何中位数(Geometric Median, GM)作为替代聚合函数,实现无需调参、具有最优有限样本断裂点1/2的鲁棒性。理论分析表明,RoPoLL在有限样本下达到与信息论最小最大下界匹配的参数率σ*√(d/N),仅在断裂地板上因计算效率限制与不可行的Tukey半空间中位数存在√d倍差距,形成统计-计算间隙。实验在13个开源大模型、3个奖励模型基准及4种污染场景(最高50%)下验证,RoPoLL在各类有偏污染(包括跨维度攻击和重尾拜占庭攻击)中全面优于PoLL,尤其在38B参数的三评委组合下,以18倍参数优势超越675B的Mistral-Large-3,在30%双峰随机污染下仍保持1.31倍性能提升,且控制实验确认该优势源于对抗性偏差而非良性噪声。
链接: https://arxiv.org/abs/2606.30931
作者: Anish Acharya,Kris W Pan,Brian Verkhovsky
机构: Amazon Web Services (亚马逊云科技)
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Probability (math.PR)
备注:
Abstract:The LLM Jury, a Panel of LLM Evaluators (PoLL) reporting consensus scores, has become a practical alternative to single-judge LLM evaluation, yet its statistical behavior remains poorly understood. We formalize the LLM Jury under the Huber contamination model and show that PoLL incurs unbounded bias under any positive contamination, regardless of jury size, whenever a single judge fails in a biased, LLM-typical way (mode collapse, sycophancy, safety refusal). Framing jury consensus as classical robust mean estimation, we propose RoPoLL (Robust Panel of LLM-as-Judge), which preserves the PoLL panel but replaces the aggregation function with a robust mean estimator, instantiated with the geometric median (GM): tuning-free, with the optimal finite-sample breakdown point 1/2. A finite-sample error bound and a matching information-theoretic minimax lower bound agree on the parametric rate sigma*sqrt(d/N) and differ on the breakdown floor by a factor of sqrt(d), a statistical-computational gap that polynomial-time RoPoLL pays relative to the intractable Tukey halfspace median. Across 13 open-weight judges (4B-675B), three reward-model benchmarks, and four corruption regimes at rates up to 50%, RoPoLL dominates PoLL on every biased corruption type: by about 19% on cross-dimensional attacks at matched compute, and by orders of magnitude on heavy-tailed Byzantine adversaries. A 3-judge RoPoLL committee at 38B beats Mistral-Large-3 (675B) by 1.31x on HelpSteer-2 under 30% bimodal-random corruption, an 18x parameter advantage at better accuracy; a Noisy-GT control confirms the premium is paid against biased contamination, not benign imprecision. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Optimization and Control (math.OC); Probability (math.PR) Cite as: arXiv:2606.30931 [cs.AI] (or arXiv:2606.30931v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.30931 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Anish Acharya [view email] [v1] Mon, 29 Jun 2026 21:34:27 UTC (474 KB) Full-text links: Access Paper: View a PDF of the paper titled RoPoLL: Robust Panel of LLM Judges, by Anish Acharya and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.AI prev | next new | recent | 2026-06 Change to browse by: cs cs.LG cs.MA math math.OC math.PR References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[MA-16] Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering ICML2026
【速读】:该论文旨在解决机器学习工程代理(ML engineering agents)在每次竞赛中均面临“冷启动”问题,导致其重复浪费计算资源去重新发现已有技术。其核心解决方案是提出一种分层多智能体系统HASTE,通过将跨竞赛知识组织为三个层级(全局、领域和竞赛特定),并为每个层级配置相应的匹配智能体。系统由一个协调器管理领域专家,并利用大语言模型(LLM)驱动的抽象机制实现层级间的知识迁移。消融实验表明,在固定159项技能的情况下,分层加载策略可实现100%奖牌率,而平铺加载仅达62.5%,与不加载任何技能的效果相当,且输出令牌消耗翻倍。在完整的MLE-Bench Lite基准测试(22个Kaggle竞赛)上,HASTE使用Claude Sonnet 4.6模型,每竞赛耗时12小时,达到77.3%的奖牌率。在暖启动场景下,系统复用先前竞赛中积累的全局与领域级技能,相比冷启动显著减少52%的优化迭代次数,且智能体采纳提议变更的比例从低库存下的42%提升至50项以上技能时的85%。结果表明,更优的知识组织架构可在一定程度上替代更强的模型能力或更大的计算预算,从而提升智能体效率。
链接: https://arxiv.org/abs/2606.30911
作者: Yongbin Kim,Yashar Talebirad,Osmar R. Zaiane
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 19 pages. Accepted at the ICML 2026 Workshop on Deep Learning for Code (DL4C)
Abstract:ML engineering agents waste compute rediscovering known techniques because every competition is a cold start. We present HASTE, a hierarchical multi-agent system that organizes cross-competition knowledge into three scope tiers (global, domain, and competition-specific), each coupled to a matching agent level. An orchestrator coordinates domain specialists and promotes learning between tiers via LLM-driven abstraction. A controlled ablation provides evidence for scoped loading: holding a 159-skill inventory constant across 8 competitions, tiered loading achieves a 100% medal rate while flat loading reaches only 62.5%, the same medal rate as loading no skills, and consumes 2x the output tokens. On the full MLE-Bench Lite benchmark (22 Kaggle competitions), HASTE reaches a medal rate of 77.3% using Claude Sonnet 4.6 at 12h per competition. In a cold-start run, the system begins with no accumulated skills. In warm-start runs, it reloads skills learned from earlier competitions, using only global and domain-level skills for transfer across competitions. Warm starts use 52% fewer refinement iterations, and the fraction of proposed changes kept by the agent rises from 42% at low inventory to 85% once 50+ skills are available. These results suggest that better knowledge organization can partly substitute for model strength and compute budget in ML-engineering agents.
[MA-17] Sampling-Based Coordination-Informed Multi-Objective Multi-Robot Reinforcement Learning
【速读】:该论文旨在解决多机器人系统在协同任务中同时优化多个相互竞争目标时,如何实现高效、分布式且具备良好适应性的协调决策问题。现有基于多智能体强化学习的方法通常依赖于固定或集中式的协调机制,不仅限制了系统的动态适应能力,还违背了实际应用中对分布式控制的约束要求。本文提出的协调感知多目标强化学习(Coordination-Informed Multi-Objective Reinforcement Learning, CIMORL)框架,其核心解决方案在于引入分布式权重预测机制、特权专家训练策略,并提供帕累托最优解的理论保证。通过结合基于采样的两种变体——CIMORL-TS(树搜索)与CIMORL-MPPI(模型预测路径积分),该框架在训练阶段利用全局特权信息,从而实现完全去中心化部署。实验结果表明,在协作与对抗场景下,相比现有最优基准方法,该框架实现了21.2%的超体积(hypervolume)提升及更优的策略稳定性;真实世界中使用Crazyflie无人机进行的测试进一步验证了其在部分可观测条件下资源分配及多攻多防任务中的鲁棒性。
链接: https://arxiv.org/abs/2606.30893
作者: Antonio Marino,Esteban Restrepo,Soon-jo Chung,Paolo Robuffo Giordano,Claudio Pacchierotti
机构: University of Cambridge (剑桥大学); CNRS, Univ Rennes, Inria, IRISA (法国国家科学研究中心、雷恩大学、法国国家信息与自动化研究所、伊里斯研究所); California Institute of Technology (加州理工学院)
类目: Robotics (cs.RO); Multiagent Systems (cs.MA)
备注: 20 pages, 11 figures, 4 tables
Abstract:Multi-robot systems must simultaneously optimize competing objectives while maintaining coordinated behavior. Existing multi-agent reinforcement learning approaches often rely on fixed or centralized coordination, which limits adaptability and violates distributed constraints. This work introduces the Coordination-Informed Multi-Objective Reinforcement Learning (CIMORL) framework, integrating a distributed weight prediction mechanism, a privileged expert training strategy, and theoretical guarantees for Pareto-optimal solutions. We present the base CIMORL method alongside two sampling-based variants, CIMORL-TS (Tree Search) and CIMORL-MPPI (MPPI), which leverage privileged global information during training to enable fully decentralized deployment. Experimental validation in cooperative and adversarial scenarios demonstrates a 21.2% hypervolume improvement and superior policy stability compared to state-of-the-art baselines. Real-world experiments with Crazyflie drones further validate the framework’s robustness in resource allocation and multi-attacker multi-defend scenarios under partial observability.
[MA-18] raining Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在心理健康支持应用中,尽管具备潜在价值,但其治疗质量提升受限于评估机制仅为被动指标而非可行动的控制信号这一核心问题。现有方法往往依赖生成能力的增强,却忽视了基于人类价值观对输出进行动态反馈与修正的重要性。解决方案的关键在于提出一个双阶段框架:首先,在第一阶段引入TheraJudge,这是一个基于人类标注数据通过偏好优化训练的开源心理治疗评估器,能够对7个心理学维度(如安全性、相关性、共情等)提供可靠且一致的评价;其次,在第二阶段引入TheraAgent,通过协调“批评者”(Critic)、“教练”(Coach)和“治疗师”(Therapist)三个角色的决策精炼流程,将TheraJudge提供的多维评估信号转化为针对性响应修正,实现闭环优化。实证结果表明,TheraJudge在与临床医生评分的一致性上表现出色(组内相关系数ICC = 0.87–0.95),显著优于监督基线及部分封闭源代码评估工具;而基于其评估反馈的TheraAgent在盲评中使人类评定的治疗质量平均提升0.43分(5分制),低质量输出(≤3分)的修复率高达94%,平均提升2.45分,充分验证了以人类对齐评估驱动的主动干预策略在提升心理健康大模型疗效中的关键作用。研究强调,有效对齐心理健康大模型的核心不在于模型生成能力本身,而在于能否将人类导向的评估转化为可执行的修正机制。
链接: https://arxiv.org/abs/2606.30887
作者: Mizanur Rahman,Abeer Badawi,Elahe Rahimi,Laleh Seyyed-Kalantari,Frank Rudzicz,Enamul Hoque,Elham Dolatabadi
机构: York University ( York 大学); Vector Institute (向量研究所); Connected Minds (连接心智); Dalhousie University (达尔豪西大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Large language models show promise for mental health support, yet therapeutic quality improves only when evaluation functions as an actionable control signal rather than a passive metric. We introduce a framework that formulates therapeutic response generation as a decision-refinement problem driven by multi-dimensional, human-aligned evaluation. In Stage I, we introduce TheraJudge, an open-source therapeutic evaluator trained via preference-based optimization on human-annotated data to produce reliable judgments across 7 psychological dimensions. In Stage II, we introduce TheraAgent, which operationalizes TheraJudge’s evaluations through a coordinated refinement process with specialized Critic, Coach, and Therapist roles that translate evaluative signals into targeted response revisions. Empirically, TheraJudge achieves strong agreement with clinician ratings, with intraclass correlation coefficients (ICC = 0.87-0.95), surpassing supervised baselines and strong closed-source judges, particularly on critical dimensions such as Safety, Relevance, and Empathy. Acting on these evaluations, TheraAgent yields a +0.43 improvement in human-rated therapeutic quality (on a 5-point scale) under blind evaluation, with 96% clinician inter-rater reliability. Low-quality responses ( \leq 3 ) improve by +2.45 points with a 94% recovery rate, demonstrating targeted correction of unsafe outputs. Overall, our results indicate that effective alignment of mental-health LLMs stems from acting on human-aligned evaluation, rather than relying solely on stronger generation. We release code at this https URL.
[MA-19] Emergent Culture in Minimal LLM Systems
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)代理在缺乏上下文、提示信息极少且仅具备简单工具的情况下,如何实现自发协作与复杂行为演化的问题。其核心挑战在于探索在无显式指令和外部控制的前提下,多智能体系统能否通过自组织形成具有长期一致性和文化特征的集体行为。解决方案的关键在于构建一个基于群体工程(swarm engineering)的多代理框架:三个代理被赋予消息通信能力,并可操作一个随时间主动衰减的共享文本存储空间,从而引入进化压力。在此机制下,代理自发产生协作模式、发展出存储管理策略,并生成不断演化的复杂文化产物,表现出超越存储熵极限的长程结构化一致性,符合Sperber提出的“涌现文化”(emergent culture)概念。这一成果揭示了在去中心化、低干预条件下,由简单规则驱动的智能体集体可自然演化出高度复杂的社会性行为。
链接: https://arxiv.org/abs/2606.30668
作者: Simon Jones,Sabine Hauert
机构: University of Bristol, UK (布里斯托大学,英国)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA); Adaptation and Self-Organizing Systems (nlin.AO); Populations and Evolution (q-bio.PE)
备注: 9 pages, 6 figures. Accepted for publication at Alife 2026 conference
Abstract:What happens when LLM agents operate with no context outside a turn, minimal prompting, and simple tools? Inspired by swarm engineering, we give collectives of three agents the ability to send messages and manipulate a shared actively decaying text store, introducing evolutionary pressure. The agents spontaneously cooperate, develop storage management strategies, and generate complex evolving cultural artifacts, with no top-down engineering. Using tools from dynamical systems analysis, we show that these behaviours exhibit structured long-range coherence beyond the entropy horizon of the decaying store, consistent with emergent culture in the Sperberian sense.
自然语言处理
[NLP-0] Introspective Coupling: Self-Explanation Training Tracks Behavioral Change Despite Fixed Supervision
【速读】: 该论文旨在解决生成式语言模型(Generative AI)在训练过程中生成预测解释时,如何确保其解释具有真实可信赖的内省性(introspection),而非仅对训练信号进行表面模仿的问题。其核心挑战在于:当模型被要求解释自身决策依据时,如何避免解释内容与实际行为脱节,从而实现对当前行为的真实反映。论文提出的关键解决方案是采用基于反事实(counterfactual)行为的监督信号——即利用模型自身早期检查点或行为相似但不同架构的其他模型在修改输入后的反事实输出作为解释训练的标签。研究发现,只要解释训练过程中保持与当前模型行为的足够相关性,即便训练标签来自固定的历史数据,模型仍能产生与其当前行为高度一致的解释,形成“内省耦合”(introspective coupling)。这一机制在多个任务中表现稳健,包括讨好倾向(sycophancy)和拒绝行为(refusal),且对标签噪声具有鲁棒性。因此,该研究揭示了即使使用静态的反事实解释数据集,也能为模型提供可扩展、通用的后训练内省信号,显著提升了模型解释的可信度与动态适应能力。
链接: https://arxiv.org/abs/2606.32038
作者: Zifan Carl Guo,Laura Ruis,Jacob Andreas,Belinda Z. Li
机构: MIT EECS(麻省理工学院电子工程与计算机科学系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 19 figures
Abstract:When does training language models (LMs) to generate explanations of their predictions yield faithful introspection, rather than superficial imitation? We study LMs trained to explain which features of their inputs influenced their behavior, using models’ counterfactual behavior on modified inputs as supervision. Surprisingly, we find that LMs trained on fixed counterfactual explanations derived from earlier checkpoints of themselves, or even from behaviorally similar models in different families, frequently produce explanations more faithful to their own current behaviors than to those of their training targets. This “introspective” coupling between LM explanations and behaviors occurs when training explanations remain sufficiently correlated with current behaviors over the course of training, even as behaviors themselves shift. We also show that introspective coupling tracks behavior shifts: when explanation training is provided concurrently with other post-training objectives, explanations track those shifts without requiring updated supervision. This phenomenon appears in multiple tasks, including sycophancy and refusal, and is robust to label noise. Overall, our results show that even fixed datasets of counterfactual explanations can provide scalable and generalizable post-training signal for introspection.
[NLP-1] QVal: Cheaply Evaluating Dense Supervision Signals for Long-Horizon LLM Agents
【速读】: 该论文旨在解决生成式智能体(LLM agents)在长时序决策任务中因采用仅基于最终结果的奖励机制而导致中间动作评估信号过稀疏的问题。传统密集监督方法虽通过评分中间步骤以提供更精细的反馈,但其性能评估常依赖于集成训练管道的下游表现,存在成本高、训练工程因素与监督质量混淆、不同方法难以公平比较等缺陷。为此,本文提出QVal——一种无需训练的测试平台,其核心在于直接评估密集监督信号的质量:给定状态-动作对,衡量某方法所生成的评分是否与强参考策略(reference-policy)的Q值排序一致(Q-aligned),从而实现对监督信号本身优劣的独立量化。该方法的关键创新在于将监督信号质量从训练流程中解耦,使不同方法可在统一基准上进行可比性评估。研究者基于此构建了QVal-v1.0,对21种密集监督方法在四个多样化环境、七类方法体系下进行了超过1200次实验,涵盖六种开源模型架构。结果表明,简单的提示基线(prompting baselines)普遍优于文献中较新的复杂方法,且性能表现高度聚集于方法家族层面,这一发现跨模型规模、环境和观测模态均成立。QVal具备良好的可扩展性,支持研究人员在训练前快速迭代与验证新监督方法。
链接: https://arxiv.org/abs/2606.32034
作者: Sergio Hernández-Gutiérrez,Matteo Merler,Ilze Amanda Auzina,Joschka Strüber,Ameya Prabhu,Matthias Bethge
机构: Tübingen AI Center, University of Tübingen (图宾根人工智能中心,图宾根大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 10 pages, 5 figures in main text; 48 pages, 6 figures with appendix
Abstract:LLM agents increasingly act over long horizons, where a single trajectory can contain hundreds or thousands of actions. In these settings, outcome-only rewards provide too sparse guidance, failing to inform the model about the goodness of intermediate actions. Dense supervision methods aim to solve this problem by scoring intermediate steps, from intrinsic confidence to self-distillation and embedding similarities. However, it is common practice to evaluate them by measuring the downstream performance of a training pipeline that integrates them. This is expensive, conflates supervision quality with training engineering confounders, and renders different methodological families requiring distinct training setups incomparable. As a result, dense supervision methods are rarely benchmarked on common ground. We introduce QVal, a training-free testbed for directly evaluating dense supervision signals. Given a state-action pair, QVal measures how well a method’s score is Q-aligned: whether it orders actions according to the Q-values of a strong reference-policy. This lets us compare signals before any training run and separate signal quality from other engineering choices. We instantiate QVal as QVal-v1.0, benchmarking 21 dense supervision methods across four diverse environments and seven methodological families, with over 1.2K evaluation experiments across six open-weight model backbones. We find that simple prompting baselines consistently outperform recent dense supervision methods from the literature, and that performance clusters strongly by family. These findings hold across model sizes, environments, and observation modalities. QVal is designed to be easily extensible to new environments and methods, enabling researchers to iterate on dense supervision methods before any training run.
[NLP-2] Reinforcement Learning with Metacognitive Feedback Elicits Faithful Uncertainty Expression in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在元认知(metacognition)能力上的系统性缺陷,具体表现为高置信度幻觉、无法识别知识边界以及未能准确表征内部不确定性,从而严重影响模型的可信度与可靠性。其核心解决方案在于引入两种创新机制:基于元认知反馈的强化学习(Reinforcement Learning with Metacognitive Feedback, RLMF),通过优化偏好过程中对模型自我评估性能质量的反馈来精炼生成结果的排序;以及元认知数据选择(metacognitive data selection),利用模型自身的性能自评来筛选高价值训练样本,优于传统的主动学习方法。研究将这些方法应用于忠实校准(Faithful Calibration, FC)任务——一个本质上的元认知任务,目标是使模型表达出的置信度与其内在不确定性相一致。采用两阶段解耦策略,先通过上述方法校准模型自报置信度的忠实性,再通过针对性输出编辑将其映射为可适应上下文的语言化不确定性表达。实验表明,RLMF在多种任务上实现了通用且领先的忠实校准性能,同时保持了原有准确性,并相较于标准强化学习提升高达63%,显著增强了模型对其自身能力边界的评估与表达能力。该研究证明,元认知表现可作为有效的强化学习信号,克服以往内在反馈方法的局限,为提升大模型的元认知能力与对齐水平提供了新范式。
链接: https://arxiv.org/abs/2606.32032
作者: Gabrielle Kaili-May Liu,Avi Caciularu,Gal Yona,Idan Szpektor,Arman Cohan
机构: Yale University; Google Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Code: this https URL
Abstract:Metacognition is a critical component of intelligence that describes the ability to monitor and regulate one’s own cognitive processes. Yet LLMs exhibit systemic deficiencies in key metacognitive faculties: they hallucinate with high confidence, fail to recognize knowledge boundaries, and misrepresent their internal uncertainty–undermining trustworthiness and reliability. Since monitoring task performance and adapting behavior accordingly are central to metacognition, we posit that models capable of accurately judging their own performance are better positioned to improve it. We operationalize this idea via two novel mechanisms: reinforcement learning with metacognitive feedback (RLMF), a paradigm to refine completion rankings during preference optimization based on the quality of a model’s self-judgments of performance, and metacognitive data selection, which uses similar self-judgments to identify high-value training examples, outperforming naive active learning. We apply these innovations to the problem of faithful calibration (FC), a task that is itself fundamentally metacognitive: the goal is to align expressed with intrinsic uncertainty, difficult even for frontier LLMs. We adopt a two-stage, decoupled approach, first using these methods to calibrate the faithfulness of models’ self-reported confidence scores, then mapping to natural, context-adaptable linguistic uncertainty via targeted output editing. Extensive experiments show RLMF achieves generalizable, state-of-the-art FC on diverse tasks while preserving accuracy. Further, RLMF surpasses standard RL by up to 63% while enhancing models’ ability to assess and express their own capability limits. This positions RLMF as a promising paradigm to enhance LLM metacognition toward improved abilities and alignment, and suggests metacognitive performance as an effective RL signal to overcome limits of prior intrinsic feedback methods.
[NLP-3] When LLM s Read Tables Carelessly: Measuring and Reducing Data Referencing Errors ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理表格任务时存在的数据引用错误(Data Referencing Errors, DREs)问题,即模型在推理过程中错误引用或遗漏表格中的实际数值,尽管其能够理解表格结构。此类错误直接影响中间推理步骤的正确性与可靠性,而现有研究对此类问题的分析范围有限且规模较小。本文首次系统性地评估了不同模型(参数量从1.7B至20B)在多种任务中发生DRE的情况,发现该现象普遍存在于所有测试模型中。解决方案的关键在于引入一种基于数据引用的批判器(critic)机制,通过批判性筛选与拒绝采样显著提升最终答案的准确率,最高可达12.0%。此外,研究还训练了一个轻量级4B参数的批判器模型,在检测分布内与分布外的DRE方面均达到平均78.2%的F1分数,有效辅助更大模型的推理过程,从而实现更高可信度的生成结果。
链接: https://arxiv.org/abs/2606.32029
作者: Yuqing Yang,Qi Zhu,Zhen Han,Boran Han,Zhengyuan Shen,Shuai Wang,Vassilis N. Ioannidis,Huzefa Rangwala
机构: University of Southern California; AWS AI Labs
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: ACL 2026 (Oral)
Abstract:While large language models (LLMs) perform well on table tasks, they still make data referencing errors (DREs), i.e., incorrectly citing or omitting table values, despite understanding the table structure. Beyond final-answer accuracy, DREs directly compromise the correctness and reliability of intermediate reasoning steps. Yet prior studies have only offered limited, small-scale analyses. In this work, we present the first systematic evaluation of tabular data referencing errors across different models and tasks. Our results show that DREs occur across all tested models (1.7B to 20B parameters). Furthermore, we demonstrate that incorporating data referencing as a critic significantly improves answer accuracy up to 12.0%, through critic-based filtering and rejection sampling. Finally, we trained a lightweight 4B-parameter critic model that achieves an average F1 score of 78.2% in detecting both in-distribution and out-of-distribution DREs, and effectively assists inference for larger models.
[NLP-4] Generative Skill Composition for LLM Agents
【速读】: 该论文旨在解决大语言模型(LLM)智能体在执行复杂任务时面临的技能组合选择瓶颈问题。随着技能库的不断扩展及其在多任务、多领域中的可复用性提升,如何高效地从海量技能中选出合适的技能子集、确定其数量并规划执行顺序,成为制约智能体性能的关键挑战。现有方法要么将全部技能暴露给推理过程,要么依赖嵌入或基于大模型的重排序器进行检索,均未能充分考虑技能组合的结构性本质——即技能选择、数量与执行顺序三者之间存在紧密耦合关系,不可分割。为此,论文提出“结构化技能组合”(structured skill composition)这一新范式,形式化为:在给定任务和技能库的前提下,预测一个可执行的技能计划,联合指定激活的技能子集、数量及执行顺序。为此,作者提出SkillComposer框架,将其建模为任务条件下的技能序列生成问题,采用受约束的自回归解码器对技能标识符进行建模,使技能子集、数量与顺序在单次解码过程中自然协同生成,并能有效捕捉技能间的依赖关系。通过利用真实人工标注的技能库构建训练数据集,实验表明,SkillComposer在保留低提示词成本的同时,在SkillsBench基准上显著提升了下游任务的成功率,相较于无技能基线分别在GPT-5.2-Codex和Gemini-3-Pro-Preview上提升23.1和18.2个百分点,超越了前3名检索方法,且达到黄金技能检索上限水平,验证了其在结构化决策上的优越性。
链接: https://arxiv.org/abs/2606.32025
作者: Xinyu Zhao,Zhen Tan,Vaishnav Tadiparthi,Nakul Agarwal,Kwonjoon Lee,Ehsan Moradi Pari,Hossein Nourkhiz Mahjoub,Tianlong Chen
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent LLM agents benefit from skills for solving complex tasks. Skills encapsulate modular packages of procedural knowledge and instructions for performing specialized tasks, such as setting up a sandboxed environment, running a test suite, or refactoring a function across multiple files. As skill libraries grow and become reusable across tasks and domains, selecting an appropriate skill composition has emerged as a central bottleneck. Existing approaches fall into two categories. One exposes the agent’s reasoning to the entire skill collection; the other performs skill retrieval via embeddings or LLM-based rerankers. Both provide useful insights; however, they miss the structural nature of skill composition, which is a joint decision over which skills, how many, and in what order – three dimensions that cannot be decoupled. We formalize this as structured skill composition: given a task and a skill library, predict an executable skill plan that jointly specifies the activated subset, count, and execution order. We propose SkillComposer, which instantiates structured skill composition as task-conditioned skill sequence prediction. SkillComposer uses a constrained autoregressive decoder over skill identifiers, so subset, count, and order emerge jointly from a single decoding pass, and dependencies between successive skills are captured naturally. We build a training set of task-composition pairs from a real, human-curated skill library. We then evaluate SkillComposer along two axes: composition quality on a held-out test set, and downstream task success on SkillsBench across two production-grade coding agents. On GPT-5.2-Codex, Gemini-3-Pro-Preview, SkillComposer raises the pass rate by +23.1, +18.2pp over the no-skill baseline, surpassing top-3 retrieval and matching the gold-skill retrieval upper bound at lower prompt-token cost.
[NLP-5] SemRF: A Semantic Reference Frame for Residual-Stream Dynamics in Language Models
【速读】: 该论文旨在解决语言模型在深度方向上计算过程演化分析中的核心难题:即中间解码时各层读出坐标的一致性问题。传统方法依赖于嵌入锚点与反向嵌入读出之间的对齐,若二者对语义跨度的选取不一致,所观测到的“语义运动”可能源于测量漂移而非真实计算动态。为此,本文提出语义参考系(Semantic Reference Frames, SemRF),一种基于锚点的分离式形式化框架,将语义度量与残差动力学解耦。其关键在于通过伪逆绑定实现精确同步,在受限双可逆条件下,确保语义基坐标稳定、畸变有界且变化近似恒等。固定参考系后,残差计算转化为沿深度的语义轨迹。锚点诱导出语义Voronoi图,以距离或逻辑值等证据将各层分配至粗粒度单元,同时保留单元内部的状态运动与边界裕度。进一步定义了逐层步长、贡献谱和不平衡诊断,并基于Voronoi轨迹构建裕度松弛管状结构;其中规范轨迹为该管内最小作用路径,当管状约束非空且二次权重为正时,其唯一且满足离散样条方程。过量作用量控制步长、曲率及谱失配程度;低曲率表明轨迹具有分段线性可压缩性与局部知识密度高,即轨迹复杂度低意味着更少的语义节点。通过参数到轨迹映射,该框架建立了对参数效率的条件关联:在符合数据拟合的可行设置中,低作用量与低复杂度轨迹对应更少的语义自由度,从而提升模型效率。上述理论保证依赖于受控接口误差与显式管状约束下的小投影残差。
链接: https://arxiv.org/abs/2606.32022
作者: Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: an early-stage version
Abstract:Residual-stream analysis asks how language-model computation evolves across depth, but intermediate decoding requires comparable readout coordinates across layers. If embedding anchors and unembedding readout disagree on the chosen span, apparent motion may reflect measurement drift rather than computation. We introduce \emphSemantic Reference Frames (SemRF), an anchor-based formalism separating semantic measurement from residual dynamics. A SemRF fixes anchors and measures states against them. Pseudo-inverse tying gives exact synchronization; under restricted bi-invertibility, SemRF yields stable semantic-basis coordinates, distortion bounds, and near-identity changes. With the frame fixed, residual computation becomes a depthwise semantic trajectory. The anchors induce a semantic Voronoi diagram: distance, or evidence such as logits, assigns each layer to a coarse cell, while coordinates retain within-cell motion and margins. We define layerwise steps, contribution profiles, and imbalance diagnostics, then use the Voronoi trace to define a margin-relaxed tube. The canonical trace is the minimum-action path inside this tube; when nonempty with positive quadratic weight, it is unique and obeys a discrete spline equation away from active constraints. Excess action controls step, curvature, and profile mismatch. Low curvature implies piecewise-linear compressibility and local knowledge density: lower trace complexity means fewer semantic knots. Through the parameter-to-trajectory map, this gives a conditional link to parameter efficiency: among admissible settings fitting data, lower-action and lower-complexity traces use fewer semantic degrees of freedom. The guarantees require controlled interface error and small projection residual under explicit tube constraints.
[NLP-6] Scalable Behaviour Cloning on Browser Using via Skill Distillation
【速读】: 该论文旨在解决浏览器智能体(browser agent)在执行复杂任务时面临的决策瓶颈问题,即在信息不完整的情况下如何做出有效决策。现有方法受限于低层级操作能力,而真正制约因素在于智能体缺乏基于人类交互经验的高层语义先验知识。其核心解决方案是通过技能蒸馏(skill distillation)实现可扩展的行为克隆:将用户在浏览器中的交互轨迹转化为紧凑的自然语言形式的技能表示,使智能体能够直接阅读、检索、复用和组合这些技能。进一步地,研究构建了一个技能图谱(skill graph),以结构化方式组织技能,促进技能的整合与演化,而非无限制累积。这一方法表明,浏览器智能体的可扩展性不应依赖于人工设计的任务,而应源于互联网用户已产生的集体行为经验。
链接: https://arxiv.org/abs/2606.32014
作者: Kaisen Yang,Zheng Jiang,Yuzhao Peng,Houde Qian,Boshi Zhang,Youjie Zheng,Shijin Hong,Qingle Liu,Ruoyu Han,Bohan Lyu,Bingxiang He,Eren Cai,Calvin Xiao,Qinhuai Na
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Internet users collectively perform an enormous range of skilled work through web browsers, from software development and document editing to search, forms, and enterprise workflows, making human browsing a highly scalable but under-exploited source of reusable browser skills. We argue that the bottleneck for browser agents is decision-making under incomplete information rather than low-level operation, and that the priors agents lack are already implicit in human interaction traces. We therefore study scalable behavior cloning for browser agents via skill distillation, converting user interaction trajectories into compact natural-language skills that agents can read, retrieve, reuse, and compose directly. We further organize the distilled skills into a skill graph so that growth proceeds through consolidation rather than unbounded accumulation. This suggests that the scalability of browser agents may come less from manually designed tasks and more from the collective skills already expressed by internet users. Our project is available at: this https URL.
[NLP-7] DigitalCoach: Communication and Grounding Gaps in Human and Agent ic Computer Use Coaching
【速读】: 该论文旨在解决如何让智能代理(Agent)不仅能够自动化软件操作任务,还能有效指导人类用户学习使用计算机软件的问题。其核心挑战在于构建具备类人教学能力的协作式智能体,使其在实际交互中能像人类专家一样进行有效的教学。解决方案的关键在于构建一个名为DigitalCoach的多模态数据集,该数据集包含72个由人类专家对新手进行计算机使用辅导的对话会话,涵盖28.1小时的屏幕与输入事件记录,并包含22,752轮对话。利用该数据集,研究评估了当前最先进的大模型在教学行为上的表现,发现尽管模型生成的语句在语言上接近人类参考,但其教学方式存在显著缺陷:倾向于提供直接指令,却缺乏解释、错误诊断和知识检验问题;同时,模型生成的回应在视觉上下文感知方面表现不佳,导致教学内容与界面状态脱节。通过自动评估与交互式评估相结合,研究证实模型教练使学习者被动执行指令,难以实现深度认知参与。因此,该研究提出的核心突破在于通过高保真、多模态的真实教学数据,为开发具备主动性和情境感知能力的协作式计算机使用辅导代理奠定了基础。
链接: https://arxiv.org/abs/2606.31980
作者: Meng Chen,Anya Ji,Tsung-Han Wu,Tobias Maringgele,David M. Chan,Alane Suhr,Amy Pavel
机构: University of California, Berkeley (加州大学伯克利分校); Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Agents are increasingly capable of automating software tasks, but can they teach humans how to use software themselves? We introduce DigitalCoach, a multimodal dataset of 72 human expert-novice computer use coaching sessions consisting of 22,752 dialogue turns grounded in 28.1 hours of screen and input event recordings across five software applications. We use DigitalCoach to evaluate whether state-of-the-art models can teach humans how to use computers. Automated evaluation shows that models differ from humans in how they coach: models provide more direct instructions, but fewer explanations, error diagnoses, and knowledge-check questions. When we fix the coaching method, models produce utterances similar to human references yet poorly grounded in visual context. Interactive evaluation confirms that model coaches cause learners to passively follow instructions without deeper engagement and fall short in visual grounding. DigitalCoach lays a foundation for collaborative and proactive computer use coaching agents.
[NLP-8] Signed-Permutation Coordinate Transport for RMSNorm Transformers
【速读】: 该论文旨在解决现代大语言模型(LLM)工作流中跨检查点坐标索引对象(如引导向量、稀疏自编码器、Top-k神经元集合、归因列表及合并对齐等)的准确传递问题。核心挑战在于,此类坐标传递在未固定模型残差流(residual-stream)规范(gauge)前是病态定义的,而该规范具有架构依赖性:使用LayerNorm的模型其残差流具有置换对称性(permutation gauge Sd,至多全局符号翻转),而采用RMSNorm且通道增益非均匀的模型则具有带符号的置换对称性(signed-permutation gauge Bd=Sd⋉±1d)。因此,仅依赖置换对齐的方法对RMSNorm模型而言是不对称完备的。论文的关键解决方案是提出符号边际化匈牙利匹配(sign-marginalized Hungarian matching),并证明了原始有符号相关性匹配在坐标不相关时存在结构性的置换精度上限,该上限等于真实规范中正号分量的比例,而符号边际化可彻底消除此限制。进一步地,论文将坐标保持性传输而非函数级合并作为核心目标:沿同一基础微调轨迹组合保存检查点的局部Bd规范,可在1500步时恢复91.1%的跨运行坐标,显著优于端点匹配的60.3%,且该增益无法仅通过基线路由解释。恢复的Bd规范使原本因置换对齐失效的工具得以正常运作——例如,TinyLlama SAE重构的归一化均方误差(NMSE)从Sd下的1.08降至Bd下的0.004;Qwen情感引导保留95.8%效果,而Sd下仅17.2%;拒绝引导在Sd下甚至发生符号反转。此外,坐标保持型合并亦表现出一致行为。这一规范一致性同样适用于状态感知训练:基于Bd的AdamW状态符号传输可精确恢复原有训练轨迹,而仅依赖置换对齐的状态则会走向功能等价但路径不同的轨迹。最后,通过规范扫描审计验证,指数级可解释性声明仅在显式指定规范的前提下才具备可复现性。
链接: https://arxiv.org/abs/2606.31963
作者: John Sweeney
机构: Sideplane AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 31 pages, 2 figures, 26 tables
Abstract:Modern LLM workflows move coordinate-indexed objects across checkpoints: steering vectors, sparse autoencoders, top- k neuron sets, attribution lists, and merge alignments. This is only well posed after fixing the model’s residual-stream gauge, which we show is architecture-dependent: LayerNorm residual charts have permutation gauge S_d (up to a global sign flip), while RMSNorm charts with generic per-channel gain have signed-permutation gauge B_d = S_d \ltimes \pm 1^d . Permutation-only alignment is therefore symmetry-incomplete for RMSNorm models. We introduce sign-marginalized Hungarian matching and prove a sharp failure mode: with decorrelated coordinates, raw signed-correlation matching has a structural permutation-accuracy ceiling at the positive-sign fraction of the true gauge, which sign-marginalization removes. We then make coordinate-preserving transport, not function-level merging, the primary object: composing saved-checkpoint local B_d gauges along same-base fine-tuning trajectories recovers 91.1% of cross-run coordinates at 1500 steps versus 60.3% for endpoint matching, and the gain is not explained by merely routing through the base. The recovered gauge transfers tools that permutation-only alignment breaks: TinyLlama SAE reconstruction has NMSE 0.004 under B_d versus 1.08 under S_d ; Qwen sentiment steering preserves 95.8% of its effect versus 17.2%; refusal steering reverses sign under S_d ; coordinate-preserving merges behave the same way. The same covariance governs stateful training: signed transport of AdamW state preserves the resumed trajectory, while permutation-only state follows a different one from a functionally identical checkpoint. Finally, gauge-sweep audits show index-level interpretability claims are reproducible only relative to an explicit gauge.
[NLP-9] LuxEmo: Expressive Text-to-Speech Corpus for Luxembourgish
【速读】: 该论文旨在解决低资源语言(如卢森堡语)在语音技术研究中长期被忽视的问题,特别是缺乏高质量、带有情感标注的对话式语音数据集。针对这一问题,研究提出构建一个21小时的卢森堡语情感表达语音语料库LuxEmo,涵盖4种情感类别。其解决方案的关键在于设计了一套半自动语料库清洗流程,整合了语音活动检测(Voice Activity Detection, VAD)、降噪、语言识别、基于LuxASR的语音切分、自动情感预测、词汇线索分析以及针对性的人工审核,从而高效且准确地从广播内容中提取高质量语音数据。此外,研究还对五种不同类型的表达性文本转语音(Expressive TTS)系统进行了基准测试,涵盖基于德语的跨语言迁移、多语言卢森堡语支持、卢森堡语适应及非参数化韵律迁移等方法,并通过客观指标与人工评估相结合的方式全面评估性能,为低资源语言的语音合成研究提供了可复用的技术路径与数据基础。
链接: https://arxiv.org/abs/2606.31947
作者: Nina Hosseini-Kivanani,Sandipana Dowerah
机构: 未知
类目: Computation and Language (cs.CL)
备注: 7 pages, 4 figures, under review
Abstract:State-of-the-art speech datasets predominantly focus on widely spoken languages, often overlooking low-resource languages such as Luxembourgish, which remain underrepresented in speech technology research. In this work, we introduce LuxEmo, a 21-hour conversational expressive speech corpus for Luxembourgish with 4 emotion categories. LuxEmo is derived from Radio Télévision Luxembourg (RTL) youth broadcasts, using automated detection followed by human validation. We propose a semi-automatic curation workflow combining voice activity detection, denoising, language identification, LuxASR-based segmentation, automatic emotion prediction, lexical cues, and targeted human review. Additionally, we benchmark five expressive TTS systems covering German-based cross-lingual transfer, multilingual Luxembourgish support, Luxembourgish adaptation, and non-parametric prosody transfer. Performance is evaluated using both objective metrics and human evaluation.
[NLP-10] heory of Mind and Persuasion Beyond Conversation: Assessing the Capacity of LLM s to Induce Belief States via Planning and Action
【速读】: 该论文旨在解决当前大型语言模型(LLM)在代理化(agentic)和自主化应用场景中,传统基于被动问答的“心智理论”(ToM)评估范式无法充分衡量其非对话式社会推理能力的问题。现有评估方法侧重于模型通过语言交流影响他人信念的能力,而忽视了模型通过实际行动(如移动物体、引导角色行动)来主动塑造其他智能体信念状态的潜在能力。为此,本文提出一种新的评估框架——非对话规划心智理论(NCP-ToM),其核心在于考察模型能否在不依赖对话说服的前提下,通过物理动作设计实现特定的信念目标。研究采用新构建的NCP-ExploreToM框架,在600个任务实例中测试包括GPT-5、Gemini 2.5 Pro及Claude 4系列在内的前沿模型与人类参与者的表现。结果显示,GPT-5在代理场景下成功完成约80%的任务,并成为唯一超越人类表现的模型,但其鲁棒性仍不及人类。此外,所有模型与人类均在诱导真实信念状态的任务上表现优于虚假信念任务,这一发现为模型对齐(alignment)提供了积极信号。研究表明,生成式AI在非对话情境下的社会推理能力正在快速演进,亟需发展以代理行为为核心的新一代评估体系,以全面理解其安全性与社会适应性。
链接: https://arxiv.org/abs/2606.31916
作者: Ben Slater,Matteo G. Mecattaf,Lucy G. Cheke,John Burden,Winnie Street
机构: University of Cambridge(剑桥大学); Google(谷歌); Leverhulme Centre for the Future of Intelligence(利弗休姆未来智能中心); Prolific(普罗利费克)
类目: Computation and Language (cs.CL)
备注: 29 pages, 12 figures
Abstract:Theory of Mind (ToM) benchmarks for Large Language Models (LLMs) typically rely on passive question-answering formats, but the deployment of LLMs in increasingly agentic and autonomous forms demands new evaluations. In this paper we evaluate an agent’s ability to induce specific belief states in other agents by taking actions rather than using conversational persuasion, a capability we call Non-Conversational Planning ToM (NCP-ToM). NCP-ToM is likely to be essential for many agent use-cases, including within user-assistant interactions and pedagogical contexts, but may also present manipulation or misinformation risks. Using a novel framework, NCP-ExploreToM, we subvert the conventional task structure by providing models with a set of belief state goals and requiring them to move objects or direct characters into rooms to achieve their goals. We evaluated six frontier models, including GPT-5, Gemini 2.5 Pro and the Claude 4 series, and a cohort of human participants, across 600 task instances. GPT-5 was successful on approximately 80% of tasks in the agentic setting, and was the only model to outperform human participants on our task, but was still less robust than humans across contexts. We additionally found that all models, like humans, performed better on tasks inducing true belief states than false belief states, which is a positive signal for alignment efforts. These findings highlight emerging social-reasoning capabilities in LLMs for non-conversational task completion and underscore the necessity of agentic evaluations for understanding the safety and alignment of autonomous social agents.
[NLP-11] Review Residuals: Update-Conditioned Residual Gating for Transformers
【速读】: 该论文旨在解决传统残差连接(Residual Connection)在深层神经网络中因固定系数1而导致的更新不可靠性问题,即网络无法根据当前状态和候选更新的可信度动态调整信息传递。其核心解决方案是提出“审查残差”(Review Residuals),通过引入一个由可学习、输入依赖的门控机制(gate)来调节每层子模块的更新量,形式为:$ h_l = h_{l-1} + r_l \cdot u_l $,其中门控值 $ r_l = \sigma(W[\text{RMSNorm}(h_{l-1}), \text{RMSNorm}(u_l)]) $,该门控同时依赖于前一状态与候选更新,实现了对更新可靠性的动态评估。关键创新在于将门控机制显式地条件化于更新本身,区别于以往仅依赖状态或静态缩放的残差结构。研究发现:第一,采用凸型(Highway风格)门控会导致梯度消失,模型训练难以超过约20层;而采用加性、保持恒等性质的门控形式可在任意深度稳定训练;第二,在模型规模扩大时出现“规模涌现效应”——小模型(60M–1B)下无明显优势,但在590M及以上规模显著优于参数匹配的Highway门控和标准残差结构(p < 0.05),且性能提升随模型规模增大而增强,表明该机制在大规模模型中具有更强的适应性和优越性。
链接: https://arxiv.org/abs/2606.31859
作者: Kyle Kramer
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 9 pages, 2 figures. Also on Zenodo: this https URL ; Code: this https URL
Abstract:Residual connections add every sublayer’s proposed update with a fixed coefficient of one; the network never evaluates whether an update is reliable before committing it. Drawing on the human-factors principle of independent verification, we introduce Review Residuals, which scale each update by a learned, input-dependent gate conditioned on both the current state and the proposed update: h_l = h_l-1 + r_l * u_l with r_l = sigmoid(W[RMSNorm(h_l-1), RMSNorm(u_l)]). Conditioning the gate on the update is the property that distinguishes it from prior gated and scaled residuals. We report two findings. First, a depth-stability result: a convex (Highway-style) form of the gate reintroduces vanishing gradients and fails to train beyond ~20 layers, whereas the additive, identity-preserving form trains stably at all depths we tested. Second, an emergence-with-scale result: trained from scratch across five sizes (60M-1B parameters, multi-seed), Review Residuals show no advantage at small scale but at 590M significantly outperform both a parameter-matched Highway gate and a parameter-matched standard residual (p0.05), with a larger advantage at 1B. The benefit grows with model size rather than shrinking.
[NLP-12] Explicit Fuzzy Logic in the Feed-Forward Layer: Self-Forgetting Quantifiers Discover Legible Grammatical-Licensing Detectors
【速读】: 该论文旨在解决生成式模型中前馈神经网络(FFN)子层缺乏可解释性的问题,即尽管注意力机制能够捕捉语义差异,但其后续的FFN层在计算过程上仍为“黑箱”,无法明确表达其内部逻辑结构。其核心解决方案是提出一种参数中立的、具备逻辑可解释性的前馈网络——负向能力前馈网络(NC-FFN),其中每个隐藏单元显式地执行基于sigmoid归一化[0,1]隶属度的模糊集合运算,包括交集(AB)与集合差(A(1−B)),后者引入了有界正向否定(“A但非B”),弥补了传统门控/双线性单元在否定表达上的缺失。该设计使模型在浅层即可实现高效的逻辑推理(如在N位奇偶校验任务中表现最优),并在大规模语言建模(125M参数,OpenWebText数据集)中达到与GELU基线相当的困惑度,且每个单元均携带明确的逻辑形式。然而,模型仍存在两个关键缺陷:二元逻辑操作局限于第一层并随训练退化,以及语法许可与量词使用等深层语法能力薄弱。为此,研究引入一个小型序列量词模块,包含软存在量词与软比例量词,结合每单元自学习遗忘率与粘滞初始化(sticky init),有效恢复初始阶段的语法能力(首轮训练即显著缩小后续差距),在LAMBADA任务上取得小幅领先,并使整个FFN结构具备可读性:其内部结构不仅可迁移至深层,且单位激活模式可被解读为语法许可检测器——每个单元识别特定许可词(如比较级、被动分词、否定极性词),并将其记忆传递以预测被许可的词汇(如than, by, nor)。这种可解释性虽受限于某一划分边界(完全布尔型的FFN会训练发散),但最终构建了一个参数中立、语言模型性能达标、且通过构造即具可解释语法机制的Transformer,不仅揭示了前馈层的表征内容,更阐明了其语法许可机制的内在运作方式。
链接: https://arxiv.org/abs/2606.31845
作者: Mark Oskin
机构: University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:A transformer’s feed-forward (FFN) sublayer materializes the distinctions attention gathers, yet gives no account of what it computes. In a parameter-neutral replacement, each hidden unit is an explicit fuzzy set operation on sigmoid-bounded [0,1] memberships: intersection AB and set-difference A(1-B), the latter a bounded positive negation (“A but not B”) that gated/bilinear units lack – a negation-capable FFN (NC-FFN). On N-bit parity they are the most parameter-efficient reasoning basis at shallow depth; at scale (125M, OpenWebText) NC-FFN ties the GELU baseline’s perplexity, every unit carrying explicit logical form. Two limits share one cause: two-operand logic localizes to layer 0 and erodes under training, and the one robust grammatical deficit concentrates in licensing and quantifiers, beyond within-token operators. We resolve both with a small block of sequence quantifiers: a soft existential and a soft proportion, each with a per-unit learned forgetting rate from a sticky init. This recovers the deficit at epoch one (halving the wider epoch-two gap), modestly leads on LAMBADA, and makes the FFN legible: the structure now holds and migrates into depth; the decay un-learns its stickiness (median half-life ~1.5 tokens; zero latch units); and at the semantic layers the units read, without dictionary learning, as grammatical licensing detectors: each fires on a licensor (a comparative, a passive participle, a negative-polarity item) and carries its memory forward to predict the licensed word (than, by, nor). This legibility is localized and free only up to a partition (a fully Boolean FFN diverges in training), but the result is a parameter-neutral, language-model-quality transformer with a readable, interpretable-by-construction grammatical mechanism – an account not just of what a feed-forward layer represents but how it licenses.
[NLP-13] CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield
【速读】: 该论文旨在解决大语言模型在计算资源受限场景下的高效训练与部署问题,核心挑战在于如何在保持模型性能的同时显著降低计算开销与参数规模。其解决方案的关键在于提出三种互补的优化技术:首先,通过选择性监督与逐标记效率机制(Selective Ground Truth Token Training, SGT),仅对约15%承载语义信息的输出标记施加监督信号,利用位置共享变换器权重中的梯度耦合效应(正向梯度耦合系数γ̄ = 0.72),使剩余85%未监督标记仍能通过辅助任务迁移获得显著性能提升,实现每单位监督标记4.5倍的训练效率增益;其次,采用深度压缩与递归恢复机制,将48层、10亿参数的变压器模型通过相邻层平均压缩至6层(2.27亿参数),并借助可学习的递归展开恢复表征能力,在有效34层递归结构下达到接近566M稠密模型的损失表现(2.934 vs. 2.926),实现2.5倍参数量缩减;最后,引入压缩专家融合机制(Mixture of Efficient Experts, MoEE),通过多标记预测融合多个压缩专家模型,在激活参数数量相当的前提下,进一步降低损失至2.789,优于任一单个压缩专家。上述方法均在自研的韩语基础模型CHERRY-1.8B上验证,强调了方法的有效性边界与实证范围。
链接: https://arxiv.org/abs/2606.31796
作者: Dohyeon Kwon,Youngjin Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 33 pages, 3 figures, 28 tables. Preprint. Figures are native TikZ/pgfplots. Evaluation is loss-based; downstream benchmarks (KMMLU, HAERAE, KoBEST, MMLU) and selection-control ablations (random-15%, top-loss-15%) to appear in a future version
Abstract:We study three complementary techniques for training compute-efficient language models. (1) Selective supervision and per-token efficiency. Selective Ground Truth Token Training (SGT) concentrates supervision on the ~15% of output tokens that carry semantic payload. Through positive gradient coupling in position-shared transformer weights – a token-level instance of auxiliary-task transfer – the remaining 85% of unsupervised tokens still improve substantially, giving a 4.5x per-supervised-token efficiency (at the step-100 eval optimum, ~67% of the full-sequence loss reduction is recovered from 15% of the supervision). We prove that this improvement on unsupervised tokens is guaranteed whenever the gradient coupling coefficient gamma-bar = 0.72 is positive (Theorem 1), and show the effect is a property of natural-language structure: it collapses on shuffled text. (2) Depth compression with recurrent recovery. A 48-layer, 1B-parameter transformer is compressed to 6 layers (227M) by averaging adjacent layers and restored through learned recurrent unrolling. With 34 effective recurrent layers it reaches a held-out loss of 2.934, within measurement noise of a 566M dense model at 2.926 – a 2.5x reduction in parameters. (3) Fusion of compressed experts. Assembling several compressed models as a Mixture of Efficient Experts (MoEE) with multi-token prediction improves over each single expert at comparable active parameters: a 2-expert MoEE reaches loss 2.789 versus 2.926 for the best single compressed model. We validate these techniques on CHERRY-1.8B, a Korean foundation model whose every trainable parameter derives from our own training runs. We are explicit throughout about the scope of the evidence (one model family, Korean data, loss-based metrics) and about which claims are established versus prospective. Comments: 33 pages, 3 figures, 28 tables. Preprint. Figures are native TikZ/pgfplots. Evaluation is loss-based; downstream benchmarks (KMMLU, HAERAE, KoBEST, MMLU) and selection-control ablations (random-15%, top-loss-15%) to appear in a future version Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.31796 [cs.CL] (or arXiv:2606.31796v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.31796 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Dohyeon Kwon [view email] [v1] Tue, 30 Jun 2026 15:14:38 UTC (95 KB) Full-text links: Access Paper: View a PDF of the paper titled CHERRY: Compressed Hierarchical Experts with Recurrent Representational Yield, by Dohyeon Kwon and 1 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.CL prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[NLP-14] SpikeLogBERT: Energy-Efficient Log Parsing Using Spiking Transformer Networks
【速读】: 该论文旨在解决现有神经网络方法在日志解析(log parsing)中因依赖密集矩阵运算而导致的高计算开销与高能耗问题。其解决方案的关键在于提出SpikeLogBERT,一种基于脉冲神经网络(Spiking Neural Network, SNN)的高效日志解析框架。该框架通过融合脉冲变压器(spiking transformer)架构与来自BERT教师模型的知识蒸馏,实现了基于脉冲信号的事件驱动计算,在保持语义表征能力的同时,利用稀疏脉冲激活显著减少推理过程中的有效操作次数,从而大幅降低能量消耗。实验结果表明,SpikeLogBERT在HDFS数据集上达到0.99997的解析准确率,且在标准45nm CMOS工艺假设下,理论能耗可降低高达62.6%。
链接: https://arxiv.org/abs/2606.31781
作者: Thuan Bui,Duong Do,Tung Vu,Duc-Tho Mai,Cong-Kha Pham
机构: Swinburne Vietnam, FPT University (富布特大学越南分校); Posts and Telecommunications Institute of Technology (邮电技术研究所); Academy of Cryptography Techniques (密码技术研究院); The University of Electro-Communications (电气通信大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Log parsing is a fundamental step in automated log analysis, transforming raw system logs into structured event templates for downstream tasks such as anomaly detection and system monitoring. Existing log parsing methods range from rule-based and clustering-based approaches to neural models that learn semantic representations from log messages. However, neural approaches typically rely on dense matrix multiplications, which can result in high computational cost and energy consumption. This paper presents SpikeLogBERT, a spiking neural network framework for energy-efficient log parsing. The proposed model integrates a spiking transformer architecture with knowledge distillation from a BERT teacher model, enabling spike-driven computation while preserving semantic representation capability. By leveraging sparse spike activations and event-driven processing, the number of active operations during inference can be significantly reduced. As an initial benchmark study, experiments on the HDFS dataset demonstrate that SpikeLogBERT outperforms ANN-based neural log parsing models with a parsing accuracy of 0.99997, while reducing estimated theoretical energy consumption by up to 62.6% under standard 45nm CMOS assumptions.
[NLP-15] Bridging the Gap Between Latent and Explicit Reasoning with Looped Transformers
【速读】: 该论文旨在解决生成式语言模型在大规模场景下,隐式思维链(Latent Chain-of-Thought, Latent CoT)方法性能显著落后于显式思维链(Explicit CoT)的问题,尤其是在模型参数量超过10亿后,两者性能差距随规模扩大而加剧。其核心挑战在于如何在不增加参数量的前提下,提升隐式推理的效率与表现力。解决方案的关键在于提出一种基于循环深度(Looped or Recurrent-depth)Transformer架构的新型框架LOTUS(Looped Transformers with parallel supervision on latents),通过在并行处理的K个潜在推理块上进行R轮迭代,并对每个潜在位置施加与显式CoT相同的交叉熵损失以监督黄金思维链步骤,从而实现高效且可解释的隐式推理。该方法不仅在30亿参数规模下首次实现了与显式CoT相当的性能,还将推理阶段延迟降低了2.5倍至6.9倍,同时通过将循环后潜在表示投影回基础语言模型头,成功恢复出原始推理步骤甚至发现其他有效中间步骤,验证了潜在空间的可解释性与思维链对齐性。消融实验进一步表明,循环结构和对黄金推理步骤的并行监督均为性能关键因素。
链接: https://arxiv.org/abs/2606.31779
作者: Ying Fan,Anej Svete,Kangwook Lee
机构: UW-Madison; Microsoft Research(微软研究院); ETH Zürich; KRAFTON; Ludo Robotics
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Language models typically reason via explicit chain-of-thought (CoT), generating intermediate steps token-by-token. Latent CoT offers an alternative: it performs multi-step reasoning in the model’s hidden states, replacing decoded tokens with continuous representations for greater efficiency. However, existing latent CoT methods underperform explicit CoT beyond 1B parameters, and the gap widens with scale. Looped, or recurrent-depth, Transformers, which reuse their weights to increase computation depth without adding parameters, are a natural fit for latent reasoning. We therefore ask whether looped Transformers can bridge this gap. We answer affirmatively with a simple recipe: a looped padded Transformer that processes K latent blocks in parallel for R iterations, with a cross-entropy loss on each latent position’s gold CoT-step token, similar to explicit CoT supervision. We instantiate it as LOTUS (Looped Transformers with parallel supervision on latents). LOTUS is, to our knowledge, the first latent-CoT method to bridge the gap to explicit CoT at the 3B scale, while cutting thought-phase latency by 2.5x-6.9x from compact math expressions to natural language. Projecting LOTUS’s post-loop latents through the base LM head recovers the gold reasoning steps and even surfaces alternative valid intermediate steps, evidence that its latent space is interpretable and CoT-aligned. Ablations confirm that both the looped backbone and the parallel supervision on gold CoT tokens are essential.
[NLP-16] STEB: Style Text Embedding Benchmark
【速读】: 该论文旨在解决风格嵌入(style embeddings)评估缺乏统一标准的问题,当前相关研究多依赖各自独立的任务与数据集,导致评估结果难以比较和复现。为此,作者提出了风格文本嵌入基准测试(Style Text Embedding Benchmark, STEB),这是一个涵盖7种语言、共计96个数据集的综合性开源基准,覆盖作者身份验证、作者检索、AI生成文本检测、语言特征探查等多种应用场景。其解决方案的关键在于构建一个标准化、可复现且全面的评估框架,以推动风格嵌入技术的系统性比较与进步。实验表明,尽管语义嵌入在语义任务中表现优异,但在风格相关任务中表现普遍不佳,且不存在在所有任务上均占优的通用风格嵌入模型。该工作已将代码库开源,以促进后续研究的开展。
链接: https://arxiv.org/abs/2606.31741
作者: Rafael Rivera Soto,Anna Wegmann,Cristina Aggazzotti
机构: Johns Hopkins University (约翰霍普金斯大学); Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While semantic embeddings are rigorously evaluated on the Massive Text Embedding Benchmark, the evaluation of style embeddings remains fragmented, with each work relying on their own set of tasks and datasets. To bridge this gap, we introduce the Style Text Embedding Benchmark, a comprehensive open-source benchmark intended to standardize the evaluation of style embeddings. STEB encompasses 96 datasets across 7 languages, spanning applications such as authorship verification, authorship retrieval, AI-text detection, probing of linguistic features, and others. We find that semantic embeddings consistently fail in stylistic tasks, and that there is no style embedding that is universally superior across all tasks evaluated. We open-source the STEB code base at: this https URL.
[NLP-17] Adapting Foundation ASR Models to Dysarthric Speech: A Case Study
【速读】: 该论文旨在解决生成式语音识别(ASR)系统在发音障碍(dysarthric)语音上表现不佳的问题,从而限制了受影响使用者在日常交流中的实际应用。其核心解决方案是通过个性化微调(personalized fine-tuning)方法,将基础ASR模型适配至特定发音障碍用户的数据。研究基于Whisper模型,利用TEQST工具收集了92小时的朗读语音,并通过部署的移动端应用补充了8.8小时的用户修正数据。实验表明,仅需1.4小时的适应数据即可使词错误率(WER)从原始模型的较高水平降至15.8%,随着训练数据增加至22.5小时时进一步降低至10.7%,在使用全部可用数据(含用户修正)时达到最优的9.7%。相比之下,采用LoRA适配或Qwen3-ASR作为基础模型的表现较差。结果证明,针对个体用户的个性化微调可显著提升基础ASR模型在发音障碍语音上的识别性能,具备实际部署可行性。
链接: https://arxiv.org/abs/2606.31722
作者: Christian Huber,Laura Kernahan,Alexander Waibel
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Automatic speech recognition (ASR) systems often perform poorly in dysarthric speech, limiting their usefulness to affected speakers in everyday communication. This paper presents a personalized ASR system for a dysarthric speaker, built by adapting a foundation ASR model to speaker-specific data. Using the TEQST tool, we collected 92 hours of read speech and later added 8.8 hours of user corrections gathered through a deployed mobile application. Starting from Whisper, fine-tuning reduced word error rate to 15.8% with only 1.4 hours of adaptation data, reached 10.7% with 22.5 hours, and achieved the best result of 9.7% when using all available data including the corrections. Using LoRA adaptation and/or Qwen3-ASR as foundation model performed worse in this setting. The results show that personalized fine-tuning can make foundation ASR models substantially more effective for dysarthric speech and suitable for practical deployment.
[NLP-18] Seeing Is Not Sharing: Some Vision-Language Models Overestimate Common Ground in Asymmetric Dialogue SIGDIAL2026
【速读】: 该论文旨在解决协作对话中“共享感知”与“共享理解”之间的鸿沟问题,即对话参与者虽能共享对场景的视觉感知,但未必达成一致的理解。其核心挑战在于:如何通过对话中的语义接地(grounding)机制,准确区分哪些信息是潜在可共享的(could be shared),哪些是已实际达成共识的(has been shared)。解决方案的关键在于构建一个基于HCRC MapTask对话数据集的13,077条标注参考表达式的“解释匹配”(interpretation-matching)任务,并在系统控制的对话上下文和地图信息可访问性条件下评估视觉语言模型(VLMs)的表现。研究发现,提供真实地图图像虽提升了整体性能,但导致模型过度预测对话双方存在一致性;而仅用文本描述同一地图内容也引发类似偏差,非信息性图像则完全抑制对齐预测,表明该偏差源于任务相关的地图内容本身,而非视觉模态。此外,模型在未对齐情形下的准确性显著下降,且校准分析与参考链追踪显示,模型主要依赖地图上的静态指代线索,而非动态追踪对话历史中的接地过程。这一现象在Qwen3-VL-8B-Instruct模型中最为明显,并在其他四个来自两种架构家族的模型中不同程度存在,揭示出当前VLMs常将地图内容作为“潜在共享”的证据,混淆了“可能共享”与“已建立共知”的边界。
链接: https://arxiv.org/abs/2606.31719
作者: Nan Li,Albert Gatt,Massimo Poesio
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 9 figures, 8 tables; accepted to SIGDIAL 2026
Abstract:In collaborative dialogue, shared perception does not guarantee shared interpretation. Mutual understanding must be established through interaction. We investigate whether vision-language models (VLMs) can distinguish what could be shared from what has been shared between dialogue participants through grounding. We formulate this as an interpretation-matching task on 13,077 annotated reference expressions from HCRC MapTask dialogues, and evaluate VLMs under systematically controlled manipulations of dialogue context and map-information access. Our results show that providing authentic map images improves overall performance but shifts models toward over-predicting alignment. Textual descriptions of the same map content reproduce this bias, while non-informative images suppress alignment predictions entirely, indicating that the bias is driven by task-relevant map content, not the visual channel. This improvement comes at the cost of degraded accuracy on non-aligned cases. Calibration analysis and reference-chain tracking further suggest that models rely on static referential cues on the maps rather than tracking how grounding unfolds through dialogue history. We observe these patterns most clearly in Qwen3-VL-8B-Instruct and, to varying degrees, in four additional models from two architecture families. In models that exhibit the bias, map content, whether presented visually or textually, is treated as evidence of mutual understanding, conflating potential with established common ground.
[NLP-19] Cross-lingual Relation Extraction with Large Language Models : Zero-Shot Few-Shot and Fine-Tuned Evaluation on Romanian
【速读】: 该论文旨在解决低资源语言(如罗马尼亚语)在关系抽取(Relation Extraction, RE)任务中因缺乏标注语料而面临的性能瓶颈问题。其核心挑战在于如何在无或极少人工标注数据的情况下,实现跨语言关系抽取的有效迁移。解决方案的关键在于结合大语言模型(Large Language Model, LLM)驱动的自动数据翻译与微调策略:通过基于LLM的翻译管道将英文SemEval-2010 Task 8基准数据集自动翻译为罗马尼亚语,并在此基础上评估Gemma 4 31B模型在零样本(zero-shot)、少样本(few-shot)及QLoRA微调三种配置下的表现。研究发现,尽管罗马尼亚语在仅使用提示(prompt-only)时相比英语存在3–5个百分点的性能下降,但采用QLoRA微调可使宏平均F1分数提升超过22个百分点,显著缩小跨语言差距(从3.3降至1.4个百分点)。值得注意的是,相较于31B参数规模的生成式模型,小型单语罗马尼亚语BERT(125M)和多语XLM-RoBERTa(278M)等编码器基线模型在罗马尼亚语上表现接近甚至媲美微调后的大型模型,且计算开销低得多。因此,研究结论指出,在计算资源受限的实际部署场景中,使用31B级大模型进行单一任务的关系抽取并不具备充分合理性,而高效的小型本地化模型更具应用价值。
链接: https://arxiv.org/abs/2606.31718
作者: Dragos-Mitrut Vasile,Elena-Simona Apostol,Stefan-Adrian Toma,Adrian Paschke,Ciprian-Octavian Truica
机构: University of Bucharest (布加勒斯特大学); National Institute for Research and Development in Informatics (罗马尼亚信息学研究与发展国家研究所); Fraunhofer Institute for Intelligent Analysis and Information Systems (弗劳恩霍夫智能分析与信息系统研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Relation extraction (RE) for low-resource languages is typically constrained by the lack of annotated corpora. We investigate the feasibility of cross-lingual RE for Romanian by combining automatic dataset translation with large language model (LLM) inference. We translate the SemEval-2010 Task 8 benchmark from English to Romanian using an LLM-based translation pipeline and evaluate Gemma 4 31B under zero-shot, few-shot, and QLoRA fine-tuned configurations, against four encoder baselines spanning 125M to 560M parameters: XLM- RoBERTa (base and large), Romanian BERT, and RoBERT- large. We assess two task formulations: relation classification with marked entities and end-to-end extraction. Our results show that Romanian incurs a 3 to 5 percentage point (pp) drop relative to English in prompt-only settings, that few-shot prompting provides marginal gains over zero-shot, and that QLoRA fine-tuning improves macro F1-Score by more than 22 percentage points in both languages while reducing the cross-lingual gap from 3.3 to 1.4pp. The encoder baselines come within 1-4pp of QLoRA Gemma on Romanian despite being 50-250 times smaller, with monolingual Romanian BERT at 125M parameters matching multilingual XLM-R at 278M. The case for using a 31B model for single-task RE on Romanian is therefore weak in deployment scenarios where compute matters. We release the translated dataset, evaluation code, and trained models.
[NLP-20] RCT: A Robot-Collected Touch-Vision-Language Dataset for Tactile Generalization
【速读】: 该论文旨在解决机器人在操作开放世界物体时,触觉表征难以泛化至未见材料的问题。其核心挑战在于现有触觉数据集缺乏对接触序列(contact sequence)和材料类别间独立性的严格划分,导致评估结果因训练与测试集之间存在近似重复的物理交互样本而产生偏差。为应对这一问题,研究提出RCT(Robotic Contact Tactile)数据集,该数据集通过机器人系统在122种工业参考材料(7类)上进行多位置、多传感器(DIGIT)的完整按压采集,共获得29,279个触觉帧,并以接触序列形式保存,确保可实现跨材料、类别、传感器、接触位置及接触序列的留出评估(held-out evaluation)。关键解决方案在于:1)显式保留接触序列结构,避免帧级随机划分带来的样本重叠;2)采用固定编码器下的对比学习策略,结合均匀采样单次按压以提升对比训练效果;3)通过在未见材料上的分类探针实验验证,表明基于RCT训练的嵌入表示具备更强的新材料泛化能力。实验表明,若不剔除接触序列重叠,触觉到文本的Recall@1下降达17.7个百分点;当材料在训练阶段也被留出时,平均Recall@1仅为25.1% ± 6.1%,凸显了新材料泛化作为机器人触觉感知的核心挑战。此外,公开的TVL/HCT划分方案揭示了当前评估范式中的根本缺陷——测试序列均出现在训练集中,导致仅靠原始像素最近邻即可实现98.3%的准确匹配。因此,本研究不仅提供了一个支持序列感知、可复现的留出评估基准,更推动了对生成式触觉感知中泛化能力的深入探索。
链接: https://arxiv.org/abs/2606.31694
作者: Jingbo He,Michael Färber,Roberto Calandra
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI (德国萨克森数据科学中心); LASR Lab (感知与机器人实验室)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:For robots manipulating open-world objects, tactile representations must generalize to unseen materials. We introduce RCT (Robotic Contact Tactile), a robot-collected touch-vision-language dataset with 29,279 tactile frames from full robot presses on 122 industrial reference materials in 7 categories, recorded with three DIGIT sensors at multiple contact positions. RCT preserves each press as a contact sequence, enabling held-out evaluation across materials, categories, sensors, contact positions, and contact sequences. Frames from one press are strongly correlated: frame-random splits can place near-duplicate observations of the same physical interaction in both training and test. With the encoder held fixed, removing contact-sequence overlap reduces tactile-to-text Recall@1 by 17.7 percentage points. When materials are additionally held out at training time, performance drops sharply, leaving held-out-material Recall@1 at 25.1 +/- 6.1% averaged over three held-out draws. The public TVL/HCT split shows the same structure: every test contact sequence appears in training, and raw-pixel nearest neighbors recover the correct sequence in 98.3% of cases. Uniformly sampling a press improves contrastive training, and RCT-trained embeddings improve category probes on unseen materials. RCT makes contact-sequence-aware, held-out-material evaluation reproducible and exposes novel-material generalization as a central challenge for robotic tactile perception. The RCT dataset is open-sourced at this https URL
[NLP-21] Overview of the TalentCLEF 2026: Skill and Job Title Intelligence for Human Capital Management
【速读】: 该论文旨在解决人资管理领域中自然语言处理(Natural Language Processing, NLP)技术应用的评估瓶颈问题,具体聚焦于两个关键任务:一是基于上下文的职位-候选人匹配(Task A),即在英文和西班牙语环境下,从简历中识别并排序最适配特定职位空缺的候选人;二是职位-技能匹配与技能类型分类(Task B),即针对给定职位名称检索最相关的技能,并区分核心技能与情境性技能。其解决方案的关键在于构建一个多语言、跨任务的共享评估基准,通过标准化的数据集与评价机制,推动社区在人才匹配与技能识别方面的技术创新。该挑战吸引了113支注册团队,提交超过400项参赛结果,体现了研究界对人力资源管理中NLP技术评测框架日益增长的关注与参与。
链接: https://arxiv.org/abs/2606.31692
作者: Luis Gasco,Hermenegildo Fabregat,Laura García-Sardiña,Paula Estrella,Warre Veys,Casimiro Pío Carrino,Matthias De Lange,Daniel Deniz Cerpa,Álvaro Rodrigo,Jens-Joris Decorte,Rabih Zbib
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper presents an overview of the second edition of the TalentCLEF challenge, organized as a Lab at the Conference and Labs of the Evaluation Forum (CLEF) 2026. TalentCLEF is an initiative aimed at advancing Natural Language Processing research in Human Capital Management. The second edition of the challenge consisted of two tasks: Task A, contextualized job-person matching, focuses on identifying and ranking the most suitable candidates represented by their resumes for a given job vacancy in English and Spanish. Task B, job-skill matching with skill type classification, addresses retrieving the most relevant skills for a given job title in English and distinguishing between core and contextual skills. TalentCLEF attracted 113 registered teams and received more than 400 submissions in the two tasks, reflecting the growing interest of the research community in shared evaluation benchmarks for Human Capital Management. This paper describes the motivation and organization of the challenge, summarizes the datasets and evaluation settings, and reports the main results obtained by the participating teams.
[NLP-22] Moral Safety in LLM s: Exposing Performative Compliance with Puzzled Cues
【速读】: 该论文旨在解决当前大型语言模型在医疗、法律及招聘等高道德影响场景中,其伦理行为是否具有真实可靠性的问题。现有公平性评估方法存在严重偏差,即模型在显式标注人口统计学身份(如“女性”或“少数族裔”)时表现出较高的公平性,但在身份需通过上下文推断时,其公平性显著下降。这种现象被称为表演性合规(performative compliance)——模型仅在呈现符合公平性评估形式时表现公平,而当提示线索减弱时,其行为便显露出偏见。论文提出的关键解决方案是引入提示变异方法(cue-variation methodology),在保持道德困境与身份信息不变的前提下,仅改变身份信息的表达方式(如显式标签或隐含推断),从而揭示模型对提示线索的依赖性。实验表明,隐藏显式标签会使有害决策增加4.4个百分点以上,并导致模型安全排名发生改变,且该效应在模型正确推断身份后依然存在,排除了误判的可能性。为此,论文进一步提出提示可见性差距(Cue Visibility Gap),一个模型无关的鲁棒性度量指标,可嵌入任意现有公平性基准测试中,以区分真正的道德稳健性与表面合规。研究强调,忽略提示变异的评估仅测量表层合规,而非真正的道德鲁棒性,因此不应作为高风险场景下模型部署的依据。
链接: https://arxiv.org/abs/2606.31644
作者: Mohammadamin Shafiei,Shuyue Stella Li,Yulia Tsvetkov
机构: University of Milan (米兰大学); University of Washington (华盛顿大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:As large language models take on morally consequential roles in healthcare, legal, and hiring contexts, we need to examine whether their ethical behaviors are genuine or superficial. We show that current fairness evaluations substantially overestimate moral safety. Models appear fair when demographic identity is stated as an explicit label, yet become measurably less fair when the same identity must be inferred. We term this failure \emphperformative compliance, where a model is fair when the presentation resembles a fairness evaluation and less fair as that cue weakens. We introduce a cue-variation methodology that holds the moral dilemma and the demographic identity fixed and varies only how that identity is conveyed. Hiding the explicit label raises harmful decisions by +4.4 ~pp and changes model safety rankings, and the shift persists when models correctly infer the demographic, ruling out attribution error. We propose the \textbfCue Visibility Gap, a model-agnostic robustness metric that can be added to any existing fairness benchmark to separate genuine from performative moral safety. Fairness evaluations that omit cue variation measure surface compliance, not moral robustness, and should not ground deployment decisions in high-stakes settings.
[NLP-23] one-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition
【速读】: 该论文旨在解决南方班图语(Southern Bantu languages)在自动语音识别(ASR)领域中缺乏高效基础模型的问题,当前主流的零样本ASR模型在这些语言上的词错误率(WER)超过100%,严重制约其在教育与公共服务中的实际应用。针对这一挑战,研究提出了一种基于声调条件的课程学习框架(tone conditioned curriculum framework),其核心创新在于融合混合难度评分机制、由声调统计信息驱动的门控适配器(gated adapters)以及分阶段课程训练策略。通过在社区构建的语料库上训练,并在NCHLT数据集上进行跨数据集迁移测试,验证了模型的鲁棒性。实验结果表明,模型架构与语言特性之间存在显著交互效应:W2V-BERT在祖鲁语等祖鲁语族语言上优于Whisper,误差降低3至4 WER点;而Whisper在索托-茨瓦纳语族语言上表现更优。结合声调条件的W2V-BERT在所有数据集上实现了28.41%的平均WER,Xitsonga语言迁移测试中达到23.79%的优异表现。研究进一步揭示,不存在适用于全部六种语言的通用模型,因此实际部署应采用“按语言选择模型”并结合多语料库验证的策略。
链接: https://arxiv.org/abs/2606.31642
作者: Kesego Mokgosi,Vukosi Marivate,Sitwala Mundia,Unarine Netshifhefhe,Tsholofelo Hope Mogale,Thapelo Sindane
机构: Technological University Dublin (都柏林理工学院); University of Pretoria (普利托利亚大学); Lelapa AI; Meta (Meta); Data Science for Social Impact (数据科学与社会影响); African Institute of Data Science and AI (非洲数据科学与人工智能研究所); AI4D Africa Program (非洲人工智能4发展项目); International Development Research Centre (国际发展研究中心); Foreign, Commonwealth and Development Office (英国外交、联邦及发展事务部); Research Ireland Centre for Research Training in Digitally Enhanced Reality (研究爱尔兰中心数字增强现实培训); ADAPT Research Ireland Centre for AI-Driven Digital Content Technology (爱尔兰适应研究中心人工智能驱动数字内容技术)
类目: Computation and Language (cs.CL)
备注:
Abstract:Southern Bantu languages are spoken by over 80 million people, yet current foundation ASR models still produce zero-shot WER above 100%, which limits practical use in education and public services. We addressed this gap with a tone conditioned curriculum framework for 6 Southern Bantu languages that combined hybrid difficulty scoring, gated adapters driven by tonal statistics and staged curriculum training. We trained on a community corpus and tested transfer to NCHLT to measure robustness beyond matched evaluation. Results revealed clear interactions between architecture and language, with W2V-BERT outperforming Whisper on Nguni languages by 3 to 4 WER points whilst Whisper performed better on Sotho-Tswana languages. W2V-BERT with tone conditioning reached 28.41% average WER across datasets and 23.79% on Xitsonga transfer. No single model suited all 6 languages, so deployment should pair model selection per language with validation across corpora.
[NLP-24] CLExEval: A Human-in-the-Loop Framework for Qualitative Evaluation of LLM Clinical Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在医疗临床推理评估中存在“评价幻觉”(evaluation illusion)的问题,即模型生成的流畅且结构化的解释可能看似合理,但其最终诊断却错误,导致自动化评估无法真实反映模型的临床可靠性。其解决方案的关键在于提出一种人机协同的评估框架CLExEval,通过逐步信息遮蔽(progressive information masking)模拟临床决策中的信息稀缺场景,并结合5,600条专家医师标注与200条来自40个罕见病案例的临床推理轨迹进行系统分析。研究识别出三大核心失败模式:(i)冗长性偏差(verbosity bias),在信息受限时诊断准确率从95.0%骤降至32.5%;(ii)隐性知识悖论(hidden knowledge paradox),专业模型虽具备高达92.5%的诊断潜力,但在冗长上下文中难以稳定调用该知识;(iii)推理-输出不一致(reasoning-to-output mismatch),68.6%的正确诊断出现在推理过程但未被纳入最终结论。此外,对LLM作为裁判(LLM-as-a-Judge)范式的评估表明,即使经过人工验证的错误样本,GPT-4o-mini仍批准了47.9%的临床错误输出,而HuatuoGPT-o1虽能识别所有有效错误,但表现出正向自我偏好偏差。这些发现揭示了脱离专家基准验证的纯自动化评估可能严重高估模型的临床可靠性,强调了专家参与在构建可信医疗AI评估体系中的关键作用。
链接: https://arxiv.org/abs/2606.31608
作者: Ajmal M.,Abin Roy,Afthab Salam Kanniyan,Jawadh Abdul Kabeer,Jerin James,Preslav Nakov,Zhuohan Xie
机构: MBZUAI( MBZUAI); IIT Madras(印度理工学院马德拉斯分校); Calicut Medical College(喀拉拉医学学院)
类目: Computation and Language (cs.CL)
备注: 21 pages, 12 figures
Abstract:Large Language Models (LLMs) achieve strong results on many medical benchmarks, but their clinical reasoning remains difficult to evaluate reliably. A central risk is an evaluation illusion: fluent and well-structured explanations can appear clinically convincing even when the final diagnosis is incorrect. We introduce CLExEval, a human-in-the-loop framework for evaluating LLM clinical reasoning under progressive information masking. CLExEval combines 5,600 expert-physician annotations with 200 clinical reasoning traces derived from 40 rare diagnostic cases. Our analysis identifies three recurring failure patterns: (i) verbosity bias, where GPT-4o-mini’s diagnostic accuracy drops from 95.0% to 32.5% under information scarcity; (ii) a hidden knowledge paradox, where a specialist model reaches 92.5% maximum diagnostic potential but fails to retrieve that knowledge reliably in verbose contexts; and (iii) a 68.6% reasoning-to-output mismatch, where correct diagnoses appear in reasoning traces but are not reflected in final answers. We further evaluate the LLM-as-a-Judge paradigm on a human-verified failure set (n = 142). GPT-4o-mini approved 47.9% of clinically incorrect outputs, while HuatuoGPT-o1 approved all validly scored failures and showed a positive self-preference bias. These results suggest that standalone automated clinical evaluations can substantially overestimate clinical reliability without expert-grounded validation.
[NLP-25] Robust Text Watermarking for Large Language Models via Dual Semantic Embeddings
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本在经历语义改写(如改写、翻译)后,传统语义水印技术易失效的问题。现有水印方法在面对自然语言的语义等价变换时,其鲁棒性显著下降,难以保证水印的可检测性与持久性。为此,本文提出双嵌入水印(Dual-Embedding Watermarking, DEW)方案,其核心创新在于同时利用词元级(token-level)与上下文级(contextual)嵌入,通过信号处理方法在向量空间中进行代数运算,构建对语义偏移具有渐进退化特性的水印信号。关键在于:采用基于密钥生成的伪随机矩阵对嵌入向量进行投影以实现水印混淆,并结合底层代数结构导出的相关分布,用于统计检测与基准测试。实验结果表明,DEW 在多种主流大语言模型上均能有效提升改写后的水印可检测性,且在翻译场景下仍保持可检测性,即便先前的语义水印已严重退化。这一特性使DEW成为保障生成内容溯源性与推动负责任人工智能部署的实用且鲁棒的解决方案。
链接: https://arxiv.org/abs/2606.31602
作者: Jonas Schäfer,Cezary Pilaszewicz,Gerhard Wunder
机构: Freie Universität Berlin (柏林自由大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Preprint. 22 pages, 9 tables, 1 figure
Abstract:This work presents Dual-Embedding Watermarking (DEW), a semantic watermarking scheme for large language models (LLMs) that leverages contextual and token-level embeddings to enhance robustness against paraphrasing and translation. DEW utilizes a signal-processing methodology, applying algebraic vector-space operations to \mboxtoken and context embeddings to derive a watermark signal that degrades gracefully under semantic shifts. The method obfuscates the watermark by projecting embedding vectors through pseudo-random matrices seeded with a secret key. Relevant distributions derived from the underlying algebra are evaluated and employed for statistical testing and benchmarking of DEW. Experimental results across multiple LLMs indicate that DEW improves post-paraphrase detection while maintaining competitive text quality, and remains detectable after translation, even when prior semantic watermarks degrade significantly. These findings position DEW as a practical and robust solution for safeguarding LLM-generated text and addressing critical issues in responsible AI deployment.
[NLP-26] AutoTrainess: Teaching Language Models to Improve Language Models Autonomously
【速读】: 该论文旨在解决大语言模型(Language Model, LM)后训练过程高度依赖人工干预的问题,尤其在面对复杂、长周期的软件工程等任务时,现有自主后训练方法难以有效执行多轮规划、数据构建、稳定训练、检查点评估及实验状态维护等关键操作。其核心挑战在于,仅依靠原始命令行界面(CLI)提供的模糊动作空间,难以实现可靠且高效的自动化训练流程。解决方案的关键是提出AutoTrainess——一个将人类过往经验显式封装为可复用的工作流、规则与执行约束的生成式智能体(Generative AI Agent),通过提供结构化的代理-计算机接口(agent-computer interfaces),将规划、数据准备、训练、评估与日志记录等操作模块化并规范化。该设计使智能体能够在多轮交互中保持实验状态一致性,并显著提升训练行为的有效性与鲁棒性。在PostTrainBench基准测试中,使用GPT-5.4(Codex)的AutoTrainess平均得分达26.94,优于纯CLI基线(23.21),并在跨模型和不同工具链场景下展现出良好泛化能力,如将DeepSeek-V4-Flash(OpenCode)的表现从12.13提升至19.58。
链接: https://arxiv.org/abs/2606.31551
作者: Zhaojian Yu,Penghao Yin,Shuzheng Gao,Shilin He,Kai Cai,Xiao-Ping Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is not just a coding problem: it requires the agent to repeatedly plan iterations, construct benchmark-aligned data, run stable training jobs, evaluate checkpoints, and preserve experiment state across many hours of interaction. We present AutoTrainess, a LM agent that exposes these operations as a repository of agent-computer interfaces for planning, data preparation, training, evaluation, and logging. Rather than leaving the agent to operate in a raw CLI environment with an underspecified action space, AutoTrainess externalizes prior human experience as explicit workflows, rules, and execution constraints that guide the agent toward effective and reliable training behavior. On PostTrainBench, AutoTrainess consistently outperforms CLI-only baselines, achieving 26.94 average score with GPT-5.4 (Codex) versus 23.21 for CLI-only. It also generalizes across models and harnesses, improving DeepSeek-V4-Flash (OpenCode) from 12.13 to 19.58.
[NLP-27] Modality-Driven Search with Holistic Trace Judging for ARC-AGI-2
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在抽象推理任务中生成流畅但可能错误的推理链(reasoning traces)所导致的“高置信度错误”问题,核心挑战从单纯的推理生成转向多候选解的优选。其解决方案的关键在于两个创新性原则:一是将不同模态(文本、图像、代码)视为独立的搜索算子,通过跨模态并行生成多样化推理路径以提升假设多样性;二是采用上下文保持的全局判别机制,即在单一长上下文提示中联合比较所有候选推理链,使判别模型能够基于完整上下文进行综合评估。相比传统的自一致性或多数投票策略,该方法能有效识别并恢复少数正确但被主流模型忽略的假设,在ARC-AGI-2基准测试中取得显著性能突破——在半私有评估集上达到72.9%准确率(每任务成本38.99美元),超越当时最强的单体模型GPT-5.2 Pro(54.2%)和Gemini 3 Pro(54.0%)达18.7个百分点;在公开评估集上达到76.1%准确率(每任务成本19.69美元)。研究还系统揭示了预设提示模板与迭代精炼会抑制假设多样性、降低整体性能的负面效应,并开源全部代码以支持可复现性。
链接: https://arxiv.org/abs/2606.31543
作者: Johan Land
机构: 独立研究者(Independent Researcher)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 37 pages, 4 figures; source code available at this https URL
Abstract:Large language models can produce fluent, internally coherent reasoning traces for abstract reasoning tasks while still being confidently wrong - making selection among candidates, not just generation, the central challenge. I present a solver for ARC-AGI-2, a few-shot visual reasoning benchmark, built around two principles: (i) treating reasoning modalities as search operators, generating diverse candidates independently across text, image, and code channels, and (ii) context-preserving holistic judging, in which a judge model jointly compares all candidate reasoning traces within a single long-context prompt. Unlike self-consistency or majority voting, this approach reliably recovers correct minority hypotheses on tasks where the modal answer is wrong. On the ARC Prize semi-private evaluation set, the solver achieves 72.9 percent at USD 38.99 per task - the highest score on the verified leaderboard at the time of writing, exceeding the best standalone frontier models, GPT-5.2 Pro at 54.2 percent and Gemini 3 Pro at 54.0 percent, by +18.7 percentage points. On the public evaluation set, it achieves 76.1 percent at USD 19.69 per task. I release the full source code and document extensive negative results, including the finding that prescriptive prompting templates and iterative refinement systematically reduce hypothesis diversity and degrade performance. Comments: 37 pages, 4 figures; source code available at this https URL Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.31543 [cs.AI] (or arXiv:2606.31543v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.31543 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-28] FinPersona-Bench: A Benchmark for Longitudinal Psychometric Stability of Autonomous Financial Agents DATE
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长期金融应用中因市场环境持续演化而导致行为指令(如“保护资本”或“避免投机性投资”)逐渐失效的问题,这一现象被作者正式定义为指令显著性衰减(Mandate Salience Decay, MSD)。其核心挑战在于:尽管初始设定的行为指令旨在贯穿整个部署周期以约束模型决策,但随着时间推移与市场情境累积,这些指令的影响力逐步减弱,导致模型行为偏离预期。为客观量化该现象,论文提出FinPersona-Bench——一个合成市场仿真基准,通过将可观测价格与隐含基本面价值解耦,实现对三种典型失效模式的可验证评估:在平静市场中无信号交易、市场崩盘时恐慌抛售、以及投机泡沫期间忽视基本面价值。实验评估了18个前沿及开源的LLM,分别赋予从严格保本到激进增长的不同行为特征,结果表明MSD随时间累积且具有模型依赖性;在崩盘场景中,接受定期指令重置(re-grounding)的代理与静态代理之间的行为偏差在模拟周期末期相比初期扩大了4.4倍。值得注意的是,指令重置的效果并非普适有效:它对保守型代理在低信号市场中始终有益,但在相同环境下反而加剧激进型代理的非理性行为。因此,论文的关键解决方案是提出一种基于代理行为特征与市场状态的选择性、指令感知型重置机制,以实现长周期金融任务中可靠、动态的行为控制。
链接: https://arxiv.org/abs/2606.31522
作者: Muhammad Usman Safder,Ayesha Gull,Rania Elbadry,Fan Zhang,Yankai Chen,Xueqing Peng, Xue (Steve)Liu,Preslav Nakov,Zhuohan Xie
机构: MBZUAI( MBZUAI); The University of Tokyo(东京大学); McGill University(麦吉尔大学); The Fin AI(金融人工智能公司); Kyoto University(京都大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 29 pages, includes figures and tables; formalizes Mandate Salience Decay and introduces FinPersona-Bench
Abstract:Large Language Models (LLMs) are increasingly deployed as autonomous financial agents initialized with explicit behavioral mandates such as “preserve capital” or “avoid speculative bets” that are meant to govern every decision throughout deployment. In practice, however, as market context accumulates over long horizons, these mandates gradually lose their behavioral influence, a phenomenon we formalize as Mandate Salience Decay (MSD). To measure MSD objectively, we introduce FinPersona-Bench, a simulation benchmark in which a synthetic market decouples observable price from hidden fundamental value, enabling falsifiable evaluation across three failure modes: trading without signal in calm markets, panic-selling during crashes, and ignoring fundamental value during speculative bubbles. Evaluating 18 leading frontier and open-source LLMs, each assigned one of three behavioral profiles ranging from strict capital preservation to aggressive growth, shows that MSD compounds over time and is model-dependent. In crash scenarios, the behavioral gap between static agents and those receiving periodic mandate re-grounding grows 4.4x from the first to the final quarter of the simulation. The effects of mandate re-grounding are not uniformly positive: it consistently helps conservative agents in low-signal markets but actively worsens behavior for aggressive agents in the same setting. These findings suggest that reliable long-horizon deployment requires selective, mandate-aware re-grounding based on agent profile and market regime.
[NLP-29] RaBitQCache: Rotated Binary Quantization for KVCache in Long Context LLM Inference ICML26
【速读】: 该论文旨在解决长上下文大语言模型推理中因大规模键值(Key-Value, KV)缓存导致的性能瓶颈问题。现有稀疏注意力方法通常受限于静态固定预算(如Top-k)的检索策略,或依赖计算开销大且存在偏差的代理评分机制。为此,论文提出RaBitQCache,一种新型稀疏注意力框架,其核心创新在于采用随机旋转二值量化(randomized rotated binary quantization)与高吞吐的二进制INT4算术运算,以高效估算注意力权重。该方法设计的代理评分作为无偏估计量,并具备理论保证的误差界,从而支持自适应的Top-p检索,能够根据实际注意力稀疏度动态调整令牌预算。此外,系统还引入面向硬件的异步流水线与懒更新机制,有效缓解掩码操作带来的开销。实验结果表明,相较于当前最优基线,RaBitQCache在显著加速推理、降低内存输入/输出开销的同时,仍能保持生成质量。
链接: https://arxiv.org/abs/2606.31519
作者: Wenhao Li,Jinhao Dong,Hailin Zhang,Wenhang Shi,Wei Lu,Xiaoyong Du
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accept by ICML 26
Abstract:Long-context Large Language Model inference is severely bottlenecked by the massive Key-Value (KV) cache, yet existing sparse attention methods often suffer from static fixed-budget (Top-k) retrieval or rely on proxy scores that are computationally expensive and biased. To address these limitations, we propose RaBitQCache, a novel sparse attention framework that utilizes randomized rotated binary quantization and high-throughput binary-INT4 arithmetic to efficiently estimate attention weights. Our proxy score serves as an unbiased estimator with a proven error bound, enabling adaptive Top-p retrieval that dynamically adjusts the token budget based on actual attention sparsity. We further implement a hardware-aware system with asynchronous pipelining and lazy updates to mask overhead. Evaluations demonstrate that RaBitQCache significantly accelerates inference and reduces memory I/O while preserving generation quality compared to state-of-the-art baselines. Code is available at this https URL.
[NLP-30] Falsification Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models
【速读】: 该论文旨在解决在无法重新训练模型的部署场景中,小型冻结代码模型如何有效修复失败程序这一核心问题。传统做法通常将模型对自身失败输出的重复尝试视为“重试机制”,但该研究从波普尔主义(Popperian)视角出发,提出生成程序本质上是一种可证伪的假说(conjecture),而执行测试失败则构成相对于特定验证器的可执行反例(executable counterexample)。因此,反馈的价值不应归因于对失败代码的重复暴露,而应取决于该假说是否能接受外部、可执行的批判性检验。其解决方案的关键在于构建一种基于“假阳性控制”的可证伪性测量框架:通过包分解(packet decomposition)、假阳性镜像(placebo mirroring)、匹配预算的不一致对测试(matched-budget discordant-pair tests)、新生成确认(fresh-generation confirmation)及可执行审计(executable audits)等方法,使模型的程序生成行为与研究者关于“反馈内容有效”的主张均具备可证伪性。实验在六个HumanEval+/MBPP+任务单元中,使用三款0.5B–1.5B参数的冻结模型进行评估,共分析290个无解任务单元,生成7,000次新程序并进行1,400次预注册复核。结果显示,盲采样比单纯重试提升18个成功解锁(p=0.0021),代码+事实信息相比仅代码提升18个(p=0.00042),相比通用符号型假阳性提升15个(p=0.0041),而仅指令信息无显著差异(p=0.36)。外部控制器复核结果表明,代码+事实与盲采样表现相当(各26次解锁),且与内容无关的形状匹配假阳性无异。这表明,在此设定下,可证伪性机制的作用并非源于词汇或自我批判,而是通过与外部可执行反例进行对比实现的。
链接: https://arxiv.org/abs/2606.31511
作者: Mehmet Iscan
机构: PythaLab, Yıldız Technical University (伊斯坦布尔理工大学), Istanbul, Turkey
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 39 pages, 5 figures, 14 tables
Abstract:In deployment settings where retraining is infeasible, small frozen code models are routinely asked to repair a failed program after seeing their own failing output, usually treated as a retry mechanism. From a Popperian view, a generated program is a conjecture and a test-execution violation is an oracle-relative, executable counterexample, so feedback’s value should be attributed not to re-exposure to failing code but to whether the conjecture is opened to external, executable criticism. As the third stage of a falsification-centered measurement program, this study builds a placebo-controlled instrument that decomposes the feedback packet against a blind-resampling baseline at matched output-generation budget and against content-free, shape-matched placebos. The contribution is not a new repair algorithm but a reflexive methodology (packet decomposition, placebo mirroring, matched-budget discordant-pair tests, fresh-generation confirmation, executable audits) that makes both the model’s program conjecture and the researcher’s “feedback content works” claim falsifiable. Across six HumanEval+/MBPP+ cells with three 0.5B-1.5B frozen models, 290 dead task-cell units (no best-of-8 candidate passing the public tier) were evaluated; the main run produced 7,000 fresh generations and a preregistered follow-up 1,400 more. Blind resampling exceeded bare-code retry by +18 net unlocks (25/7, Holm p=0.0021). Code-plus-facts recovered +18 over bare code (21/3, p=0.00042) and +15 over a generic-bullet placebo (p=0.0041). An instruction-only effect was not distinguishable (+3, p=0.36). Code-plus-facts and blind resampling tied at 26 unlocks each (not equivalence). Six external-controller follow-ups tied a content-free shape placebo. In this regime, falsification helped not as vocabulary or self-critique, but as comparison with external, executable counterexamples.
[NLP-31] Building an ASR Solution for Training and Assessing Childrens Reading
【速读】: 该论文旨在解决非洲语言(以巴马拉语为例)儿童阅读自动语音识别(Automatic Speech Recognition, ASR)系统发展滞后的问题,尤其针对可重复的读写能力评估需求。其核心挑战在于缺乏高质量、专为儿童设计的语音数据集与适配的语音识别模型。解决方案的关键在于构建一个端到端的开放系统:通过移动应用在真实课堂环境中采集60名儿童共计55小时的原始朗读语音数据,进而建立首个公开可用的巴马拉语儿童阅读评估基准。在此基础上,提出一种针对巴马拉语优化的Fast-Conformer架构——Soloni,结合时间对齐解码器(TDT)与连接时序分类解码器(CTC),并通过微调实现性能突破。实验表明,最优的Soloni模型将词错误率(WER)从0.42降至0.22,字符错误率(CER)从0.15降至0.08,显著优于紧凑型卷积架构QuartzNet。此外,研究发现重复朗读对不同模型具有异质性影响:对QuartzNet带来显著提升,而对Soloni仅产生边际增益;同时,频谱增强(SpecAugment)在不超越未增强最佳配置的前提下有效调控训练过程。进一步的分层分析揭示,10岁以下儿童是残余错误的主要来源,提示需针对性收集更年幼群体的数据。最终,十次课堂试点验证了该应用在实际教学环境中的持续可用性,支持其推广潜力。
链接: https://arxiv.org/abs/2606.31508
作者: Yacouba Diarra,Nouhoum Souleymane Coulibaly,Mamadou Dembele,Aymane Dembele,Michael Leventhal
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages, 2 figures
Abstract:Automatic speech recognition for children’s reading remains underdeveloped for most African languages, including Bambara, despite its potential value for reproducible literacy assessment. We present an open-source system for assessing children’s reading in Bambara, developed through an end-to-end process linking field data collection, benchmark construction, model adaptation, a reading application, and classroom validation. A mobile collection and assessment app was used to collect 55 hours of raw reading speech from 60 children, from which we construct a public benchmark for Bambara child-reading assessment. Fine-tuning experiments compare Soloni, a Bambara-adapted Fast-Conformer ASR framework with TDT and CTC decoders, with QuartzNet, a compact convolutional ASR architecture. The best Soloni model reduces WER from 0.42 to 0.22 and CER from 0.15 to 0.08, substantially outperforming QuartzNet on the isolated benchmark. The experiments further show that repeated readings of the same texts provide architecture-dependent benefits: they substantially improve QuartzNet but add only marginal gains for Soloni, while SpecAugment regulates training without exceeding the best unaugmented configuration. Disaggregated analysis identifies children under 10 as the main source of residual errors, motivating targeted collection from younger readers. Ten classroom trials supported continued use of the application.
[NLP-32] Fork-Think with Confidence
【速读】: 该论文旨在解决生成式人工智能(Generative AI)在复杂推理任务中因采用“先思考后决策”(think-first-then-decide)范式导致的过度生成(overgeneration)问题,该范式需先采样多条推理路径再进行剪枝或终止,造成大量计算资源浪费。其核心解决方案是提出一种“先决策后思考”(decide-first-then-think)的新范式——Fork-think with confidence,其关键在于:首先基于单条种子路径中模型输出的置信度动态识别具有潜在价值的分叉点(forking points),仅在这些关键位置触发并行推理,从而高效生成多个后续路径,并通过聚合策略获得最终答案。实验表明,Fork-think 在三个主流大模型和三个推理基准上,可将令牌消耗降低高达30%、运行时间减少57%,同时保持与现有并行思考方法相当甚至更优的性能表现。深入分析揭示,该方法能有效识别对下游任务有意义的分叉点,且在较晚位置进行采样可显著提升生成质量。此外,通过与早停机制和加权投票等现有技术结合,Fork-think 可进一步逼近当前最优方法,且无需预热或离线训练,验证了预设分叉点作为高效大模型推理新方向的可行性与潜力。
链接: https://arxiv.org/abs/2606.31484
作者: Zena Al-Khalili,Rafi Hakim,Dietrich Klakow,Ji-Ung Lee
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Parallel thinking has enjoyed great success for boosting LLM performance on reasoning tasks without the need for any re-training. However, existing methods follow a think-first-then-decide paradigm, i.e., they first sample multiple reasoning paths, which inevitably leads to overgeneration, then prune or stop unnecessary paths to compensate. In contrast, decide-first-then-think, i.e., first identifying points that are likely to lead to desirable generations, has been underexplored so far. Following this paradigm, we propose Fork-think with confidence, that first identifies forking points using model confidence in a single seeding path, then triggers thinking, sampling multiple continuations and aggregating them for the final response. Our experiments across three models and three reasoning benchmarks show that Fork-think reduces the token consumption by up to 30% and run-time by up to 57%, while performing comparable to or better than parallel thinking. Our analysis reveals that Fork-think is able to identify forking points that are meaningful with respect to the downstream task and that sampling at later positions can lead to substantially better generations. Finally, we demonstrate how combining Fork-think with existing mechanisms such as early stopping and weighted voting can further boost the performance and perform comparably to existing state-of-the-art methods, without requiring any warm-up or offline training. Our results establish pre-determined forking as a promising research direction for efficient LLM reasoning.
[NLP-33] am MKC at CLPsych 2026: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics
【速读】: 该论文旨在解决心理健康领域中缺乏高效、可扩展的计算方法以实现心理状态的早期检测与持续监测的问题。随着全球范围内精神健康障碍患病率的上升以及专业心理医疗服务的可及性有限,亟需一种能够处理用户生成文本序列并动态追踪其心理状态变化的技术方案。其解决方案的关键在于提出了一种基于大语言模型(Large Language Models, LLMs)的统一分析流程,该流程能够同时实现对用户单条文本内容的细粒度评估(post-level assessment)与跨时间维度的心理状态演化建模(user-level temporal modeling),从而在顺序排列的用户发帖数据上完成全面的心理健康分析。
链接: https://arxiv.org/abs/2606.31464
作者: Kyomin Hwang,Hyeonjin Kim,Hyunho Lee,Nojun Kwak
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in Large Language Models (LLMs) have motivated their adoption across a wide range of domains, including Artificial Intelligence (AI) for mental health. Given the growing prevalence of mental health disorders worldwide and the limited accessibility of professional care, there is an increasing demand for scalable computational approaches that can assist in early detection and continuous monitoring of psychological well-being. In this area, ongoing efforts have focused on curating domain-specific datasets and leveraging them to develop LLMs capable of supporting holistic mental health analysis. In line with this direction, we propose an LLM-based pipeline for comprehensive mental health analysis over sequentially ordered user posts, as part of the CLPsych shared task. Our pipeline offers a unified framework that jointly enables post-level assessment and user-level temporal modeling.
[NLP-34] Revising RVL-CDIP: Quantifying Errors and Test-Train Overlap
【速读】: 该论文旨在解决RVL-CDIP数据集中存在的标签错误(label error)与测试集-训练集重叠(test-train overlap)问题,这两类缺陷会误导模型性能评估并影响其泛化能力。其解决方案的关键在于:(1)通过人工与自动化方法识别并修正标签错误,提升数据标注质量;(2)检测并移除测试集与训练集之间的冗余样本,以确保评估的公平性。研究构建了多个经过修正的RVL-CDIP变体,并在这些新版本上进行文档分类性能基准测试。分析表明,原始数据集中存在约12%的标签错误和约35%的测试-训练集重复。修复标签错误后分类准确率显著提升,而移除重复样本则导致准确率下降,表明重复样本可能对模型性能产生“虚假增强”效应。此外,在RVL-CDIP-N这一分布外(out-of-distribution, OOD)基准上的实验进一步验证了纠错数据的重要性——在修正后的数据上训练可显著提升模型的泛化能力,监督模型平均准确率提升8.1个百分点,最高达14个百分点,凸显了高质量数据对模型鲁棒性和泛化性能的关键作用。
链接: https://arxiv.org/abs/2606.31446
作者: Stefan Larson,Attila Nagy,Sam Desai,Cyrus Desai,Nicole C. Lima,Yixin Yuan,Siddharth Betala,Kaushal K. Prajapati,Jamiu T. Suleiman,Sharad Duwal,Kevin Leach
机构: Vanderbilt University (范德堡大学); ML Collective; University of Michigan (密歇根大学); IIT Madras (印度理工学院马德拉斯分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: DocEng 2026
Abstract:RVL-CDIP is a popular dataset for benchmarking document classifiers. However, the dataset contains ample amounts of label errors as well as non-trivial amounts of test-train overlap, both of which may impact model performance metrics. In this paper, we address these two problems by (1) finding and fixing label errors, and (2) detecting and addressing test-train overlap. We produce several variations of RVL-CDIP with label error and test-train overlap fixes, and benchmark document classification performance on these new RVL-CDIP variations. Our rigorous analysis of RVL-CDIP finds that the corpus contains 12% label error and approximately 35% test-train duplication. Remediation sees improvements in classification accuracy when errors are removed, but sees decreases in accuracy when duplicates are removed. We additionally evaluate models on RVL-CDIP-N, an out-of-distribution benchmark, finding that training on error-corrected data substantially improves OOD generalization, with supervised models gaining an average of 8.1 percentage points in accuracy and improvements as large as 14 percentage points.
[NLP-35] CDR-Bench: Evaluating Faithful Execution of Compositional Order-Sensitive Data Refinement Recipes
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在执行复杂、顺序敏感的文本数据精炼任务时存在的程序忠实性不足问题。具体而言,现有基准测试或仅关注文本编辑,或将其与代码执行和工具调用混杂,无法有效评估模型对多步骤、依赖顺序的数据精炼流程(data refinement recipes)的准确执行能力。为填补这一空白,研究提出CDR-Bench,一个涵盖四个真实世界数据精炼领域、包含29种不同操作符的综合性基准,共3,462个高质量任务。其解决方案的关键在于构建具有确定性参考输出的评估框架,能够精确衡量模型在原子级、顺序无关及顺序敏感三种情境下的表现。实验结果表明,尽管当前十余种先进LLMs在简单任务中表现尚可,但在组合式和顺序敏感的任务中均出现显著性能下降,且成功率急剧崩溃,揭示出现有模型在保持过程忠实性方面存在根本性缺陷,难以可靠地执行复杂的、依赖步骤顺序的数据精炼操作。
链接: https://arxiv.org/abs/2606.31435
作者: Yuchen Huang,Xiang Li,Zhenqing Ling,Sijia Li,Qianli Shen,Daoyuan Chen,Yi R. Fung,Yaliang Li
机构: HKUST(香港科技大学); NUS(新加坡国立大学); Tongyi Lab, Alibaba Group(通义实验室,阿里巴巴集团)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 29 pages, 20 figures. Corresponding authors: Daoyuan Chen and Yi R. Fung
Abstract:Data refinement involves executing multi-step recipes over evolving text states, where both composition and execution order of processing operators determine the outcome. While existing benchmarks either isolate text editing or entangle it with code and tool execution, it remains unclear whether LLMs can directly and faithfully execute these compositional, order-sensitive data refinement recipes. To fill this gap, we introduce CDR-Bench, a comprehensive benchmark featuring 3,462 high-quality tasks spanning four real-world data refinement domains and 29 distinct operators. Our benchmark evaluates models across atomic, order-agnostic, and order-sensitive settings, leveraging deterministic reference outputs to enable exact evaluation. Experiments on 10+ state-of-the-art LLMs reveal consistent failure patterns: performance degrades sharply in compositional settings, and order-sensitive recipe success collapses. These findings underline that current LLMs lack the procedural faithfulness required for reliable compositional data refinement.
[NLP-36] Clinically Structured Rank-Gated LoRA for Cross-Benchmark Medical Question Answering
【速读】: 该论文旨在解决医学多选题问答(Medical Multiple-Choice Question Answering, MCQA)中跨异构知识领域与推理操作的参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)难题,尤其关注不同任务类型(如药物相关、诊断决策、公共卫生、护理操作)对低秩更新(Low-Rank Update)需求的差异性,以及部分需保留基础模型表征的回忆类题目对适配器干预强度的敏感性。其解决方案的关键在于提出一种单适配器秩门控低秩方法——BiRG-LoRA,该方法在每个目标层仅维持一个LoRA模块,但通过输入条件化的秩维度选择机制,利用双轴门控(biaxial gate)融合隐藏语义证据、专业/职业先验及临床操作先验,及其交互信息,动态筛选出稀疏的 top-k 秩原子子集,并引入标量注入系数调控适配器更新强度,实现对不同任务类型的自适应响应。实验结果表明,在匹配的 Qwen3-8B CMB 源协议下,BiRG-LoRA 在涵盖 CMB、CMExam、MedQA 与 MedMCQA 的四个基准上达到 69.31% 的宏平均准确率,优于现有可训练 PEFT 基线与路由控制方法,且相比 MoELoRA 提升 0.89 个百分点,同时减少 28.1% 的可训练参数;置信区间分析显示提升具有统计显著性(95% 置信区间 [0.42, 1.37])。此外,与 vanilla LoRA r16 及主动秩匹配的 LoRA r4 相比,性能分别提升 0.83 个宏平均点,且在评估阶段弱轴扰动测试中表现稳健,验证了其对中等标签噪声的鲁棒性。研究支持一个有限结论:在统一种子协议下,基于临床结构化设计的秩分配策略能有效提升跨基准医学问答性能,而训练种子方差的影响则有待未来工作进一步探索。
链接: https://arxiv.org/abs/2606.31432
作者: Yining Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Medical multiple-choice question answering requires parameter-efficient adaptation across heterogeneous knowledge domains and reasoning operations. A medication question, a diagnostic decision, a public-health item, and a nursing-action item may require different low-rank updates, while some recall items should preserve the base model’s representation with only mild adapter intervention. We propose BiRG-LoRA, a single-adapter rank-gated LoRA method for medical question answering. BiRG-LoRA keeps one LoRA module per target layer but makes its rank dimension input-conditioned: for each question, a biaxial gate combines hidden semantic evidence with specialty/profession priors, clinical-operation priors, and their interaction to select a sparse top- k subset of rank atoms. A scalar injection coefficient further controls the strength of the selected adapter update. Under a matched Qwen3-8B CMB-source protocol, BiRG-LoRA achieves the highest four-benchmark macro-average accuracy among trainable PEFT baselines and matched routing controls: 69.31% averaged over CMB, CMExam, MedQA, and MedMCQA. It improves over MoELoRA by 0.89 percentage points while using 28.1% fewer trainable parameters; a paired, benchmark-stratified bootstrap over final predictions gives a 95% confidence interval of [0.42, 1.37] for this macro-average gain. Basic controls show that BiRG-LoRA also improves over vanilla LoRA r16 and active-rank-matched LoRA r4 by 0.83 macro points, and an evaluation-time weak-axis perturbation check suggests that performance is not brittle to moderate tag noise. The results support a bounded claim: clinically structured rank allocation improves cross-benchmark medical QA under a matched single-seed protocol, while training-seed variance remains future work.
[NLP-37] Linguistic Bias Mitigation for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck
【速读】: 该论文旨在解决生成式语音技术快速发展背景下,语音生物识别系统在对抗语音欺骗攻击时的鲁棒性问题,尤其关注现有欺骗检测模型在跨数据集场景下泛化能力差的挑战。其核心问题是:当前检测器在训练数据分布内表现良好,但在面对未见的、分布外的数据(out-of-domain)时性能显著下降,这主要源于模型对训练数据中语言特征(linguistic cues)的过度依赖,即存在语言偏差(linguistic bias)。为解决这一问题,作者提出一种语言不变的欺骗检测框架,其关键在于采用教师-学生对抗学习机制:通过一个预先在外部语料上预训练的语言感知教师模型,利用梯度反向传播指导学生检测器最小化对语言信息的依赖;同时引入变分信息瓶颈(Variational Information Bottleneck, VIB)机制,以抑制主成分特征,避免非语言线索的无意丢失。实验结果表明,在九个DF Arena数据集上,该方法相较基线模型可实现最高达36.2%的等错误率(EER)相对降低,显著提升了模型在跨域场景下的泛化性能。
链接: https://arxiv.org/abs/2606.31411
作者: Anh-Tuan Dao,Driss Matrouf,Mickael Rouvier,Nicholas Evans
机构: Avignon Universite (阿维尼翁大学); EURECOM (欧洲电信学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Rapid advancements in generative speech technology have compromised the reliability of voice biometrics. While current spoofing detectors excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. We show that this can be due to linguistic bias. A reliance on linguistic cues observed in training data can then compromise robustness to cross-data. We propose a linguistic-invariant spoofing detection framework utilizing teacher-student adversarial learning. The linguistic-aware teacher model, pre-trained on linguistic content of an external dataset, guides the student detector via gradient reversal to minimize the linguistic information. To prevent the inadvertent removal of non-linguistic cues, we incorporate a Variational Information Bottleneck to enable suppression of principal cues. Across nine DF Arena datasets, our method achieves up to a 36.2% relative reduction in the EER compare to the baseline.
[NLP-38] Visual Semantic Entropy: Do Vision Language Models Recognize Visual Ambiguity? ECCV2026
【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在面对视觉模糊输入时产生过度自信预测的问题,尤其关注现有不确定性估计方法在高置信度视觉表征下低估不确定性的缺陷。现有基于熵的方法(如语义熵,Semantic Entropy, SE)依赖输出多样性来衡量不确定性,但研究发现,在随机解码过程中,过自信的视觉嵌表示会抑制输出多样性,导致SE低估实际不确定性。尽管近期方法通过引入文本改写或图文联合扰动来探测输出多样性,但分析表明其产生的变异主要由文本变化驱动,而非视觉证据,使得不确定性估计反映的是提示敏感性而非真实的视觉模糊性。为此,论文提出仅对图像进行扰动、保持文本查询固定的视觉语义熵(Visual Semantic Entropy, VSE)方法,通过聚类生成答案为语义原型并计算其质量加权离散度来量化不确定性。大量实验在五种先进VLM和五个多样化视觉问答基准上验证了VSE能有效捕捉视觉模糊性,显著优于现有方法,确立了当前视觉语言模型不确定性估计的新基准。
链接: https://arxiv.org/abs/2606.31407
作者: Ta Duc Huy,Trang Nguyen,Townim Chowdhury,Ankit Yadav,Minh-Son To,Zhibin Liao,Johan W. Verjans,Vu Minh Hieu Phan
机构: Google(谷歌); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ECCV2026
Abstract:Vision-language models can produce confident answers on visually ambiguous inputs, resulting in biased predictions. Common entropy-based methods, such as Semantic Entropy (SE), rely on output diversity. Yet our analysis shows that overconfident visual embeddings suppress output diversity under stochastic decoding, causing SE to underestimate uncertainty in such cases. Recent methods instead probe output diversity through input perturbations, including textual paraphrasing or joint text-image perturbations, and show improved performance. We study these approaches and reveals that the resulting variability is often dominated by textual changes rather than visual evidence, causing uncertainty estimates to reflect prompt sensitivity rather than visual ambiguity. We therefore propose Visual Semantic Entropy (VSE), which perturbs only the image to probe nearby visual variations while keeping the text query fixed. VSE measures uncertainty by clustering generated answers into semantic prototypes and computing the mass-weighted dispersion among them. Extensive evaluation across five modern vision-language models and five diverse VQA benchmarks demonstrates that VSE effectively captures visual ambiguity, establishing a new state-of-the-art for VLM uncertainty estimation.
[NLP-39] Calibrating the Evaluator: Does Probability Calibration Mitigate Preference Coupling in LLM Agent Feedback Loops?
【速读】: 该论文旨在解决大语言模型(LLM)代理在基于评估者反馈进行行为自适应时,因评估者系统性偏见导致其策略分布被错误引导的问题,即评估者偏好耦合(evaluator preference coupling)。该现象表现为评估者的主观偏见通过反馈信号被固化至代理的学习策略中,从而影响决策质量与公平性。论文提出的关键解决方案是:对评估者的成对判断结果进行概率校准(probability calibration),以削弱虚假偏好信号的传播。通过在受控的组内实验(N=5)中对比标准二元TTRL(胜/负)与置信度校准后的TTRL(概率加权更新),研究发现校准可使耦合系数γ降低20%-49%,并使Jensen-Shannon散度下降45%-67%。对照实验进一步排除了更新不对称性等干扰因素的影响,证实效果源于校准本身。研究结果表明,概率校准是一种轻量级且有效的缓解手段,作者已开源校准后的TTRL协议,并建议将其纳入大模型作为评估者(LLM-as-judge)的部署流程中。
链接: https://arxiv.org/abs/2606.31371
作者: Zewen Liu
机构: Zewen Liu; Qilu Institute of Technology, School of Software Engineering (齐鲁理工学院软件工程学院), Tai’an, Shandong, China
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 2 tables
Abstract:When large language model (LLM) agents adapt their behavior through evaluator feedback, systematic evaluator biases propagate into the agent’s learned strategy distribution - a phenomenon termed evaluator preference coupling. Prior work has documented this coupling and established a diagnostic framework (EPC) to measure it, but has not investigated whether calibration techniques can mitigate the effect. We present the first study of evaluator calibration as mitigation: applying probability calibration to the evaluator’s pairwise judgments to reduce spurious preference propagation. In a controlled within-subjects experiment (N=5) comparing standard binary TTRL (win/loss) with confidence-calibrated TTRL (probability-weighted updates) using DeepSeek-V4-Pro as executor and GLM5.2 as evaluator, we find that calibration reduces the coupling coefficient gamma by 20-49% and Jensen-Shannon divergence by 45-67%. A symmetric-LR control confirms the effect is not due to reduced update asymmetry. We release the calibrated TTRL protocol and recommend it as a lightweight mitigation for LLM-as-judge deployment pipelines.
[NLP-40] BlockPilot: Instance-Adaptive Policy Learning for Diffusion-based Speculative Decoding
【速读】: 该论文旨在解决现有基于扩散的推测解码(diffusion-based speculative decoding)方法中固定推理块大小(inference block size)与统一解码策略假设所带来的性能瓶颈问题。尽管此类方法通过块级扩散实现了多标记并行生成,但其忽略了个别输入样本间最优块大小的差异性,导致整体效率受限。研究发现,最优块大小在不同样本间存在显著变化,但具有明显的局部聚集特性,集中于训练时设定的块大小附近,从而将决策空间简化为低维且结构化的形式。针对此问题,论文提出BlockPilot,一种基于样本自适应的策略机制,通过从预填充阶段的表示中预测最优块大小,将块大小选择建模为轻量级策略学习任务,并设计实例自适应的决策机制,在预填充完成后仅需一次预测即可实现无缝集成。实验表明,该方法具备即插即用性、极低开销,并持续提升推理效率,在Qwen3-4B模型上于温度T=1条件下实现了5.92的接受长度和4.20倍的加速比。
链接: https://arxiv.org/abs/2606.31315
作者: Hao Zhang,Yiming Hu,Yong Wang,Mingqiao Mo,Xin Xiao,Xiangxiang Chu
机构: AMAP, Alibaba Group (AMAP,阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: 16 pages
Abstract:Speculative decoding accelerates inference by using a lightweight draft model to generate candidate tokens in parallel, and are then verified by the target model, enabling lossless acceleration. Recently, diffusion-based speculative decoding further improves parallelism by generating multiple tokens per forward pass via block-level diffusion, achieving state-of-the-art (SOTA) performance. However, existing methods adopt a fixed inference block size and assume a uniform optimal decoding strategy across all inputs. In this paper, we show that this assumption is suboptimal, as the optimal block size varies across samples and plays a critical role in speculative decoding performance. Moreover, these values exhibit a clear local structure, concentrating around the training block size, which reduces the problem to a low-dimensional and structured decision space. Based on these insights, we propose BlockPilot, a sample-adaptive policy that predicts the optimal block size from the prefilling representation. Specifically, we formulate block size selection as a lightweight policy learning problem and propose an instance-adaptive decision mechanism that predicts the optimal block size based on the representation of the prefilling stage. The prediction is performed only once after prefilling, allowing for seamless integration. Extensive experiments demonstrate that our method is plug-and-play, introduces minimal overhead, and consistently improves efficiency, achieving an acceptance length of 5.92 and a 4.20 \times speedup on Qwen3-4B under temperature T=1 .
[NLP-41] LOPA: Enhancing Spoken Language Assessment via Latent Ordinal Prototype Alignment
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在口语语言评估(Spoken Language Assessment, SLA)中忽视语言习得内在等级结构(ordinal structure)的问题。现有方法依赖大规模模型参数与基于大语言模型(LLM)的微调,但往往缺乏对语言能力发展有序性的建模。为此,论文提出一种基于原型的正则化方法——潜在等级原型对齐(Latent Ordinal Prototype Alignment, LOPA),通过在隐空间中直接施加等级几何先验,显式建模语言能力的递进性。结合语义锚定层路由(Semantic-Anchored Layer Routing, SALR),该框架可自适应地从冻结的Whisper编码器中提取多层级表征,无需进行大语言模型微调即可实现0.361的均方根误差(RMSE),性能媲美千亿参数系统。关键创新在于LOPA与SALR的协同机制:前者确保隐空间具备可解释的等级结构,后者实现高效、分层的特征选择,从而构建出一种不依赖模型规模扩张、兼具效率与等级感知能力的新型口语评估范式。
链接: https://arxiv.org/abs/2606.31310
作者: Hong-Yun Lin,Fu-An Chao,Bi-Cheng Yan,Berlin Chen
机构: National Taiwan Normal University (国立台湾师范大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:
Abstract:Fueled by increasing model scale and multimodal inputs, Multimodal Large Language Models (MLLMs) have emerged as a promising paradigm for Spoken Language Assessment (SLA). While effective, this paradigm often overlooks the intrinsic ordinal structure of language acquisition. This paper works around the necessity of large-scale MLLMs by introducing Latent Ordinal Prototype Alignment (LOPA) for SLA, a prototype-based regularizer that enforces an ordinal geometric prior directly on the latent space. Coupled with Semantic-Anchored Layer Routing (SALR), which adaptively harvests multi-depth representations from a frozen Whisper encoder, our framework achieves an RMSE of 0.361. This performance rivals billion-parameter systems without the need for LLM-based fine-tuning. Further analysis reveals that SALR’s synergy with LOPA offers interpretable, criterion-aligned preferences, thereby supporting an efficient and ordinal-aware modeling alternative to current scaling-centric models for SLA.
[NLP-42] When the Database Fails: Prompting LLM Dialogue Agents for Safe Recovery in Task-Oriented Dialogue SIGDIAL2026
【速读】: 该论文旨在解决任务导向型对话系统中大语言模型在后端数据库调用失败、返回空结果或出现信息不匹配时,生成看似流畅但不安全的幻觉响应(hallucination)问题,例如虚构场所、确认信息或预订详情。其解决方案的关键在于提出一种轻量级的基于提示(prompting-based)的恢复机制——Guided-Retry策略,该策略通过结构化数据库状态信息对生成过程进行引导,从而在无需模型微调或额外推理调用的前提下提升系统的鲁棒性。实验表明,该方法在多域任务数据集MultiWOZ 2.2和SGD上分别将幻觉率降低50%和42%,且在六种开源模型家族(DeepSeek-R1、Gemma-2、Llama-3、Mistral、Phi-3、Qwen-2.5)及四种数据库异常场景(空结果、错误领域检索、API错误、正常检索)下均表现出一致有效性,验证了其普适性和可靠性。
链接: https://arxiv.org/abs/2606.31307
作者: Mohammad Alijanpour Shalmani,Alale Rezvani Boroujeni,Jiann Shiun Yuan
机构: University of Central Florida(中佛罗里达大学); Google DeepMind(谷歌深脑); Meta(元); Microsoft(微软); Qwen Team(通义千问团队)
类目: Computation and Language (cs.CL)
备注: Accepted at SIGDIAL 2026
Abstract:Large language models used in task-oriented dialogue often produce fluent but unsafe responses when backend database calls fail, return empty results, or surface mismatched information, inventing venues, confirmations, or booking details not grounded in the database. We study a lightweight prompting-based recovery approach that improves robustness without retraining or additional model calls. We compare three response strategies, including a guided recovery prompt conditioned on structured database status, across six open-weight model families (DeepSeek-R1, Gemma-2, Llama-3, Mistral, Phi-3, and Qwen-2.5) and four database conditions: empty result, wrong-domain retrieval, API error, and clean retrieval. Using fault-injected benchmarks built on two structurally different datasets, MultiWOZ 2.2 (5 domains) and SGD (20 domains), we find that naive agents hallucinate on 30.5% of failure turns on MultiWOZ and 20.9% on SGD. Our Guided-Retry strategy reduces hallucination by 50% on MultiWOZ (30.5 to 15.3%) and by 42% on SGD (20.9 to 12.2%) without retraining. However, residual hallucination remains substantial (6-37% across models), with wrong-domain failures the hardest case. Results are consistent across both datasets and all six model families, and human annotation shows substantial agreement while supporting the validity of the automatic commitment-safety metric.
[NLP-43] he Decomposition Is the Fingerprint: Per-Component Identity for Agent Skills
【速读】: 该论文旨在解决生成式AI(Generative AI)代理在运行时动态获取并执行技能时面临的技能身份识别难题。现有基于密码学哈希的方法因对微小变更极度敏感(如单字符修改即导致哈希值完全改变),无法有效捕捉技能间的语义相似性,因而难以支持对技能变体的稳定识别。其核心解决方案是提出一种紧凑、局部敏感的指纹机制,通过多银行SimHash将技能的三个核心组件——提示(prompt)、代码(code)和工具声明(tools)——分别嵌入并投影为固定120字节的签名,利用汉明距离实现常数时间比对。该方案的关键创新在于保持指纹为组件级三元组(prompt, code, tools)而非单一数值,从而能够通过共享组件识别出技能家族关系,支持通过改写、重命名、重构及受控代码转换等操作产生的变体,并能精准定位发生变更的具体组件;而独立的多语言重实现则不会被误判为同一技能。该指纹不承诺行为等价性,而是提供结构化的身份线索,作为技能注册表中的谱系标识,与行为验证互补,而非替代安全判断。实验表明,该指纹在4,950对比较中达到0.974的AUC(95%置信区间[0.956, 0.994]),仅使用其近似嵌入所需比特数的1/77,同时保持排名顺序的期望一致性和有限比特浓度特性,实现了关系分类、家族识别、新颖性检测以及可移植的“技能物料清单”(SkillBOM)功能。在包含906个技能的注入基准测试中,该指纹可准确识别注入技能为已知基线的篡改副本并精确定位变更点,但其本身仅为身份信号,不构成信任背书。
链接: https://arxiv.org/abs/2606.31272
作者: Hongliang Liu,Yuhao Wu,Tung-Ling Li
机构: Palo Alto Networks (帕洛阿尔托网络)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:AI agents increasingly acquire and execute skills at runtime: bundles of prompt instructions, executable code, and tool declarations fetched from marketplaces and other agents. Governing them needs a stable notion of skill identity, yet cryptographic hashing is engineered to destroy the very similarity we need, as a one-character edit scrambles the digest. We present a compact, locality-sensitive fingerprint that embeds each component of a skill and projects it to bits with a multi-bank SimHash, giving a fixed 120-byte signature compared in constant time by Hamming distance. Our central claim is that keeping the fingerprint as a per-component triple (prompt, code, tools), rather than a single score, is what makes it useful: the triple recovers skill-family identity through paraphrase, renaming, refactoring, and controlled code translation when another component remains shared, while independent multilingual reimplementation is not recovered; it also localizes which component carries the reuse. We claim lineage, not behavioral equivalence: identity supplies the structural axis of a registry and leaves safety to behavioral verification. The fingerprint reaches an area under the ROC curve (AUC) of 0.974 (95% CI [0.956, 0.994]) over 4,950 pairwise comparisons while using 77x fewer bits than the embedding it approximates, with ranking preserved in expectation and finite-bit concentration; the per-component split turns one number into relationship classification, families, novelty, and a portable “SkillBOM” for a skill registry. On a 906-skill injection benchmark the fingerprint recognizes injected skills as tampered copies of a known base and localizes the change, but recognition is not trust: it remains, by design, an identity signal complementary to behavioral verification rather than a safety verdict.
[NLP-44] Learning from Failure: Inference-Time Self-Improvement for Computer-Use Agents ECCV2026
【速读】: 该论文旨在解决计算机操作代理(Computer-use agents)在训练过程中依赖大规模高质量轨迹数据的问题,尤其针对现有基于成功轨迹自改进循环的范式仅利用成功案例而忽略失败轨迹所蕴含的丰富信息这一局限。其核心解决方案是提出一种以失败驱动的自改进机制(failure-driven self-improvement loop),通过引入大语言模型(LLM)对失败轨迹进行故障诊断、生成推理时修复方案并构造代码补丁,经轻度人工验证后用于升级代理。该方法无需额外训练成本,仅带来适度的推理开销,在OSWorld基准上使当前最先进的OpenCUA-72B模型的成功率从42.3%提升至48.9%,显著提升了6.6个百分点,验证了失败驱动策略作为成功导向流程的有效补充,可实现更高效的智能体迭代优化。
链接: https://arxiv.org/abs/2606.31270
作者: Xueqiao Sun,Xiaohan Wang,Ludwig Schmidt,Serena Yeung-Levy,Yuhui Zhang
机构: Stanford University (斯坦福大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Published in ECCV 2026
Abstract:Computer-use agents, which leverage multimodal large language models (MLLMs) to operate computers and complete tasks, have attracted significant attention for their utility and versatility. A major challenge in developing these agents is collecting large-scale, high-quality trajectories. The standard approach generates synthetic data through a self-improving loop: an agent is placed in a verifiable environment and iteratively fine-tuned on its successful trajectories. Despite its effectiveness, this paradigm exploits only successful trajectories and discards the failed ones, even though failures carry rich information about a model’s weaknesses. In this work, we explore a complementary failure-driven self-improvement loop, a data-centric paradigm that turns failed trajectories into agent improvements. Specifically, we employ an LLM to diagnose failure modes, propose inference-time solutions, and generate code patches – lightly verified by humans – that upgrade the agent. We validate this approach with the state-of-the-art OpenCUA-72B model on the OSWorld benchmark, improving the success rate from 42.3% to 48.9%, a gain of 6.6 percentage points, without any additional training cost and with only modest inference overhead. Our results demonstrate that failure-driven self-improvement is a viable complement to success-based pipelines, enabling more efficient agent improvement.
[NLP-45] Probing Stylistic Appropriation using Large Language Models : An Evaluation Framework for Copyright Infringement under EU Law
【速读】: 该论文旨在解决当前生成式人工智能(Generative AI)在训练过程中可能引发的版权侵权问题,尤其针对现有技术防护措施仅关注字面复制(verbatim memorisation)而忽视欧盟版权法所保护的“实质性相似性”(substantial similarity)这一关键法律标准之间的不匹配问题。其核心解决方案是提出PSALM框架——一个将欧盟版权法原则量化的“大语言模型作为裁判者”(LLM-as-a-judge)评估体系,通过十类评估维度系统化检测计算重叠、写作风格、叙事语调、角色设定、情节结构、场景构建等抽象创作要素,并纳入法定例外情形(如戏仿、拼贴、引用及固定场景)。实验结果表明,指令微调模型在未接触语料前即存在显著的风格相似性,微调过程进一步导致在多个与侵权相关维度上的系统性风格剽窃,且即便采用负向偏好优化(Negative Preference Optimisation)进行去学习,仍残留可检测的风格模式。这揭示了仅防范字面抄袭的技术手段无法有效应对更广泛的版权风险。PSALM为实现可审计、法律导向的合规评估提供了技术基础设施,但其自动化相似度评分与法律侵权判定之间的关联仍需法律专家验证。该研究成功实现了定性法律标准与定量技术测量的融合,暴露了生成式AI与欧盟知识产权法之间深层的制度张力。
链接: https://arxiv.org/abs/2606.31250
作者: Noah Scharrenberg,Chang Sun
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLM) trained on web-scale corpora generate output that may infringe copyright, yet existing technical safeguards focus narrowly on verbatim memorisation. EU copyright doctrine applies a broader standards: substantial similarity, which extends to stylistic choices, narrative structure, and creative elaboration. This mismatch between what current methods detect and what the law protects leaves a significant compliance gap. We introduce PSALM, an LLM-as-a-judge framework that operationalises EU copyright doctrine through ten evaluators assessing computational overlap, stylistic dimensions (writing style, narrative voice), content dimensions (character, plot, scene, world building), and statutory exceptions (parody, pastiche, quotation, scènes à faire). Applying PSALM to Llama~3.2 models fine-tuned on translated historical Dutch literary works, we find that: 1) instruction-tuned models exhibit non-trivial baseline stylistic similarity prior to corpus exposure; 2) fine-tuning induces systematic stylistic appropriation across all infringement-relevant dimensions, extending beyond verbatim memorisation to abstract narrative patterns; 3) Negative Preference Optimisation unlearning substantially reduces similarity but leaves detectable residual stylistic patterns. These findings indicate that safeguards targeting literal copying alone are insufficient to mitigate broader copyright risks. PSALM provides infrastructure for auditable, legally informed compliance evaluation, though the relationship between automated similarity scores and infringement determinations requires validation by legal experts. This work bridges qualitative legal standards and quantitative technical measurement, exposing fundamental tensions between generative AI and EU intellectual property law.
[NLP-46] Can LLM s Imagine Moral Alternatives Beyond Binary Dilemmas?
【速读】: 该论文旨在解决大语言模型(LLM)在作为道德顾问或代理时,面对价值冲突类道德困境(moral dilemma)时缺乏超越给定选项的创造性思维能力的问题。现有研究忽视了人类道德认知中一个核心特征:能够构想出超越原有选项的替代方案。为此,论文提出MoralAltDataset数据集,包含307个涵盖叙事型顾问困境与面向AI的代理困境的道德难题,并为每个困境补充了妥协型和重构型替代方案。研究发现,在引入这些替代方案后,15种不同大语言模型均表现出显著的判断偏移,多数更倾向于采纳妥协性替代方案,从而实质性地重塑了道德选择路径。进一步通过成对偏好评估与专家评判标准对比分析表明,大语言模型生成的替代方案在结构合理性与伦理契合度方面常优于人类撰写的版本,但同时揭示出结构质量与实际可行性之间的权衡关系。因此,解决方案的关键在于构建并利用可生成高质量、具有创造性的替代道德方案的能力,使大语言模型具备类人式的超越性道德推理能力。
链接: https://arxiv.org/abs/2606.31213
作者: Jongchan Choi,Nari Yang,Sung Soo Park,Jaemin Cho,Han Seoyoung,Haerin Shin,Jun-Hyung Park
机构: Korea University (韩国大学); XenoStep AI (XenoStep AI); Hankuk University of Foreign Studies (韩国外国语大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: "23 pages. Preprint
Abstract:As large language models (LLMs) are increasingly deployed as moral advisors and agents, they need to address dilemmas between two competing values. However, existing research on LLMs with moral dilemmas overlooks a central aspect of human moral cognition: the ability to imagine alternatives that move beyond the given options. We introduce MoralAltDataset, a dataset of 307 moral dilemmas spanning narrative Advisor dilemmas and AI-facing Agent dilemmas, each augmented with compromise and reframed alternatives. We first examine whether humans and LLMs shift their judgments when such alternatives are introduced. Across 15 LLMs, we find that compromise alternatives are often preferred over either original option, substantially reshaping moral choice. We then evaluate the quality of LLM-generated alternatives against human-authored ones using pairwise preference and expert-based criteria. Results show that LLM-generated alternatives are often preferred and better satisfy fine-grained structural and ethical criteria, while revealing trade-offs between structural quality and practical feasibility.
[NLP-47] Gated Multi-Graph Fusion via Graph Attention Networks for Alzheimers Disease Detection INTERSPEECH2026
【速读】: 该论文旨在解决阿尔茨海默病(Alzheimer’s Disease, AD)早期诊断中自发语言作为非侵入性生物标志物时,现有系统普遍忽视语言结构的非线性破坏及临床表型异质性的问题。其核心挑战在于如何有效建模语言内容、句法结构与语义流之间的复杂关系,并适应不同患者群体在语言表现上的显著差异。解决方案的关键在于提出一种多视角门控图注意力网络(Multi-View Gated Graph Attention Network),通过自动语音识别(ASR)将语音转录为文本后,构建语义图、依存关系图与共现图,基于“内容-结构-流”框架全面表征语言特征。其中,共现图利用来自正常人群语料库的点互信息(Pointwise Mutual Information, PMI)量化叙事逻辑连贯性与语言偏离程度,从而捕捉病理语言中的隐含模式。为应对临床异质性,模型引入自适应门控融合机制,动态加权不同图视角的信息,实现对多样化患者群体的鲁棒分类。在ADReSSo数据集上的实验表明,该方法达到90.00%的分类准确率,消融实验进一步验证了基于PMI的共现图与异质性感知门控机制对于提升模型泛化能力的关键作用。
链接: https://arxiv.org/abs/2606.31186
作者: Jinyu Li,Xiao Wei,Bin Wen,Kai Li,Yuqin Lin,Xiaobao Wang,Longbiao Wang,Jianwu Dang
机构: Tianjin University (天津大学); Chinese Academy of Sciences (中国科学院); Fuzhou University (福州大学); Huiyan Technology (Tianjin) Co., Ltd (天津慧言科技有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 1 figure, 2 tables, and accepted in interspeech 2026 conference
Abstract:Spontaneous speech is a vital non-invasive biomarker for Alzheimer’s Disease (AD), yet many systems overlook non-linear structural disruptions and clinical heterogeneity in pathological language. We propose a Multi-View Gated Graph Attention Network that transcribes audio via Automatic Speech Recognition (ASR) to construct semantic, dependency, and co-occurrence graphs, characterizing speech through a “content-structure-flow” framework. Notably, the co-occurrence graph leverages Pointwise Mutual Information (PMI) from a normative corpus to quantify narrative logic and linguistic deviation. To address symptomatic diversity, an adaptive gated fusion mechanism dynamically integrates these views. Evaluated on the ADReSSo dataset, our model achieves 90.00% accuracy. Ablation results confirm that the PMI-based graph and heterogeneity-aware gating are essential for robust classification across diverse clinical populations. Our source code is publicly available at this https URL.
[NLP-48] HealthAgent Bench: A Unified Benchmark Suite of Realistic Agent ic Healthcare Environments for Challenging Frontier AI Agents
【速读】: 该论文旨在解决当前生成式AI(Generative AI)在医疗健康领域应用中缺乏系统性、全面评估框架的问题,尤其针对具备复杂长时程推理能力的AI代理(AI agents)在真实临床场景下的性能评估。现有评估方法难以充分反映代理在多模态数据处理、端到端临床工作流执行及跨步骤决策等方面的综合能力。为此,论文提出HealthAgentBench,一个涵盖54项智能体医疗任务的基准测试套件,覆盖7个不同类别,每个类别具有独特的环境设定,模拟患者全周期诊疗流程中的多样化实际任务。其关键解决方案在于构建高度仿真的端到端临床工作流任务:要求代理在仅接收少量指令的前提下,自主探索原始医疗数据(如电子健康记录、医学影像等),在复杂环境中进行多步推理与操作,并完成超出简单提示(naive prompting)范畴的复合型任务。通过报告整体任务成功率作为单一可解释指标,实现对各代理性能的统一量化评估。实验表明,即便是在最先进的模型中,整体成功率仍较低(最强且成本效益最高的Codex GPT-5.5仅达约42%),暴露出当前模型在医学影像理解、大规模搜索空间与组合推理相结合的任务上存在显著瓶颈。该基准不仅揭示了模型在特定任务类别的优劣势,更凸显出生成式AI在真实医疗应用中仍面临巨大挑战,为未来研究提供了明确方向。
链接: https://arxiv.org/abs/2606.31179
作者: Qianchu Liu,Sheng Zhang,Guanghui Qin,Jeya Maria Jose Valanarasu,Maximilian Rokuss,Mingyu Lu,Timothy Ossowski,Juan Manuel Zambrano Chaves,Cliff Wong,Peniel Argaw,Yashna Hasija,Mu Wei,Wen-wai Yim,Qin Liu,Zilin Jing,Jason Entenmann,Naoto Usuyama,Tristan Naumann,Hoifung Poon
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As AI agents become increasingly capable of complex, long-horizon reasoning, rigorous and holistic evaluation is essential for measuring progress toward real-world healthcare applications. We introduce HealthAgentBench, a suite of 54 agentic healthcare tasks across 7 categories each with its unique environment. The benchmark suite spans diverse workflows throughout the patient journey and a broad range of modalities. Each task is designed to replicate an end-to-end clinical workflow: given minimal instructions, an agent must explore raw healthcare data, operate within a complex environment, and execute multi-step solutions that go beyond naive prompting. A final task success rate is reported to provide a single, interpretable metric for HealthAgentBench overall performance for each agent. Evaluating frontier agents on HealthAgentBench, we find that overall task success rate remains low, underscoring the difficulty of the suite. The strongest and the most cost effective agent, Codex GPT-5.5, achieves only approximately 42% success rate. Beyond aggregate performance, HealthAgentBench reveals nuanced strengths and weaknesses across task categories. Frontier agents show promise in automatically developing research modeling pipelines over EHR data, but medical imaging remains especially challenging, particularly for Claude Code models, while Codex GPT-5.5 shows emerging capability. Tasks that combine large search spaces with compositional reasoning requirements remain difficult for all current agents. Together, these results suggest that HealthAgentBench provides a challenging and realistic benchmark with substantial room for future progress. We release our benchmark at this https URL.
[NLP-49] AG-DLM: Diffusion Language Models for Text-Attributed Graph Learning
【速读】: 该论文旨在解决文本属性图(Text-attributed Graphs, TAGs)中多模态联合推理问题,即如何有效融合节点的自然语言描述与图拓扑结构进行统一建模。现有方法通常将文本与图结构处理分离:图神经网络(Graph Neural Networks, GNNs)仅使用浅层文本特征,而基于大语言模型(Large Language Models, LLMs)与图的混合方法则主要将LLM用作文本编码器,图结构学习仍依赖独立的图模块,导致模态间信息交互不充分。本文提出一种基于掩码扩散语言模型(masked diffusion language model)的统一框架,该模型兼具双向注意力机制与生成式解码能力,能够同时实现文本理解与图消息传递。其核心创新在于将采样的局部邻域线性化为词元序列,并通过拓扑注意力掩码注入图结构信息,从而在序列建模过程中实现图上的消息传递。由于该模型具备文本生成与解释能力,可通过调整提示(prompt)适配不同任务,无需针对特定任务进行微调,支持节点分类、链接预测及跨数据集迁移。实验表明,该方法在三个TAG基准上均显著优于GNN、图变压器(Graph Transformers)及现有基于LLM的基线模型,性能提升最高达3.9个点。
链接: https://arxiv.org/abs/2606.31166
作者: Lingjie Chen,Yuanchen Bei,Haobo Xu,Yanjun Zhao,Yuzhong Chen,Hanghang Tong
机构: UIUC; VISA
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Text-attributed graphs (TAGs), where each node carries a natural language description, require models to jointly reason over text and graph topology. Existing approaches often handle the two modalities separately: graph neural networks operate on shallow text features, while hybrids of LLMs and graphs use the language model mainly as a text encoder and delegate structure learning to a separate graph module. We propose method that unifies textual reasoning and graph message passing within a masked diffusion language model, a language model with bidirectional attention and generative decoding. For each graph instance, method linearises a sampled local neighbourhood into a token sequence and injects graph structure through a topology attention mask, which realises message passing over the graph. Because the diffusion language model can both interpret and generate text, the method adapts to different tasks simply by changing the prompt, supporting node classification, link prediction, and cross-dataset transfer with no target-specific fine-tuning. Experiments show that method outperforms graph neural networks, graph transformers, and LLM-based baselines on all three TAG benchmarks across two tasks, improving over the strongest baseline by up to 3.9 points. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2606.31166 [cs.CL] (or arXiv:2606.31166v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2606.31166 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[NLP-50] ComplianceGate: Classifier-Gated Multi-Tier LLM Routing for Inference in Regulated Industries
【速读】: 该论文旨在解决在受监管行业中部署大语言模型(Large Language Models, LLMs)时面临的双重挑战:合规性强制执行与成本效率优化。现有架构存在数据隐私风险与资源浪费问题,例如用户查询中的个人身份信息(PII)可能在系统判断其是否应跨越管辖边界前就已抵达模型端点;单一大规模模型处理所有查询导致GPU资源被完全占用,无论查询复杂度如何,且缺乏地理路由机制;而混合专家(Mixture-of-Experts, MoE)架构虽具备一定程度的负载分配能力,但路由发生在数据到达端点后、专家层内部,所有专家均需常驻内存,无法实现按需调度。本文提出一种基于分类器门控的路由架构,通过在解码器推理前引入一个经过训练的编码器-分类器,对每条查询进行复杂度与数据敏感性评估,并将其动态路由至相应规模和地理位置的密集模型。该设计确保含PII的查询在任何大模型计算开始前即被导向本地端点,从根本上杜绝数据跨境违规的可能性;简单查询则被高效分配至小型快速模型,显著降低计算开销。实验表明,该方案在600个查询上的评估中实现了39%的中位延迟降低、33%-52%的成本节约(取决于查询分布),生成吞吐量达122-200 tokens/second,优于基线的50-64 tokens/second;编码器分类器达到99.2%的准确率,PII召回率近乎完美,仅带来7ms的推理开销,验证了预推理分类作为“合规性即设计”(compliance-by-design)LLM部署可行路径的有效性。
链接: https://arxiv.org/abs/2606.31163
作者: Abhishek Dey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large language models deployed in regulated industries operate under two constraints: compliance enforcement and cost efficiency. Personally identifiable information (PII) in user queries can reach model endpoints before the system determines whether that data should leave its jurisdictional boundary. Serving all queries through a single large model consumes full GPU capacity regardless of query complexity while offering no mechanism for geographic routing. Mixture-of-Experts architectures do not address this routing occurs between expert layers within the model after data has already arrived at the endpoint, with all experts loaded in memory regardless of query complexity. We propose a classifier-gated routing architecture that enforces compliance by design. A trained encoder classifier sits before any decoder inference, evaluating each query for complexity and data sensitivity, then routing it to an appropriately sized dense model in the appropriate geographic location. PII-containing queries route to local endpoints before any LLM computation begins, making data residency violations structurally impossible. Simple queries reach small, fast models at a fraction of the cost. Our evaluation on 600 queries demonstrates 39% median latency reduction, 33-52% cost savings depending on query distribution, and generation throughput of 122-200 tokens/second versus 50-64 for the baseline. The encoder classifier achieves 99.2% accuracy with near-perfect PII recall at 7ms inference overhead, establishing pre-inference classification as a practical path to compliance-by-design LLM deployment.
[NLP-51] PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding
【速读】: 该论文旨在解决3D视觉定位(3D Visual Grounding, 3DVG)任务中因在全场景范围内进行推理而导致的预测模糊性与高计算开销问题,尤其在复杂杂乱环境中表现尤为显著。其核心挑战在于如何高效精准地从大规模点云场景中定位由自然语言描述所指代的目标物体。解决方案的关键在于提出一种名为PruneGround的即插即用框架,其创新性体现在三个核心组件:首先,引入语言引导的空间剪枝(Language-Guided Spatial Pruning, LGSP),利用冻结的视觉语言模型(Vision Language Model, VLM)识别与语言描述相关的局部空间区域,从而显著缩小搜索空间并降低计算负担;其次,提出多视角条件下的描述重构(MultiView-Conditioned Description Reformulation, MCDR),通过将复杂表达式分解为简化的目标-锚点关系,并借助多视角推理补全缺失的空间线索,提升语义理解精度;最后,设计基于大语言模型(LLM-Grounder)的定位器,通过在剪枝后的区域内对点云与语言表示进行对齐,将预训练的目标检测型空间大语言模型转化为语言条件下的定位模型。实验结果表明,该方法在三大主流点云基准上均达到领先性能,在所有ScanRefer设置及9/10个Nr3D/Sr3D设置中表现最优,验证了其有效性与泛化能力。
链接: https://arxiv.org/abs/2606.31148
作者: Duc Cao Dinh,Khai Le-Duc,Florent Draye,Chris Ngo,Terry Jingchen Zhang,Bernhard Schölkopf,Zhijing Jin
机构: Knovel Engineering Lab(Knovel工程实验室); University of Toronto(多伦多大学); Vector Institute(向量研究所); ELLIS Institute(ELLIS研究所); Max Planck Institute(马克斯普朗克研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint
Abstract:3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions. Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high computational cost, especially in cluttered environments. We observe that many referential expressions rely on local spatial context and often correspond to restricted spatial regions rather than the full scene. Motivated by this insight, we propose PruneGround, an effective plug-and-play framework for 3DVG built upon three key components. First, we introduce Language-Guided Spatial Pruning (LGSP), which leverages a frozen Vision Language Model (VLM) to identify language-relevant regions, thereby reducing spatial computation and grounding candidates in the narrower search space. Second, we propose MultiView-Conditioned Description Reformulation (MCDR), which decomposes complex expressions into simplified target-anchor relations and augments missing spatial cues through multi-view reasoning. Finally, we propose LLM-Grounder, which repurposes a detection-pretrained spatial LLM into a language-conditioned grounding model by aligning point cloud and linguistic representations within the pruned region. Extensive experiments on the three most popular point cloud benchmarks demonstrate that our method achieves state-of-the-art results on all three ScanRefer settings and on 9 out of 10 Nr3D/Sr3D settings. Code and models are publicly available: this https URL
[NLP-52] SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference
【速读】: 该论文旨在解决大语言模型在长上下文场景下因键值缓存(KV cache)规模随序列长度线性增长而导致的显存瓶颈问题。现有压缩方法难以兼顾计算效率与上下文信息的忠实保留:传统令牌剔除会丢失细节,而语义分组则在预填充阶段固定压缩决策,无法在生成过程中恢复被压缩片段的细粒度信息。为此,本文提出SeKV——一种自适应分辨率的语义化KV缓存机制,通过熵引导的语义跨度划分,将上下文组织为跨GPU-CPU内存层级存储的结构。每个语义跨度在GPU上保留轻量级摘要向量以实现粗粒度路由,在CPU上存储低秩奇异值分解(SVD)基底以支持按需的令牌级重建。训练好的“放大”机制可选择性地在解码过程中展开与查询相关的跨度,从而实现精准检索而不需将完整KV缓存加载至GPU。该方案在保持原始大语言模型(LLM)完全冻结的前提下,仅引入少于0.05%的可训练参数,实现了令牌级重建的自适应能力。在四个基准测试中,相比最强的语义压缩基线平均提升5.9%,同时在128K上下文长度下相较全量KV缓存减少53.3%的GPU显存占用。
链接: https://arxiv.org/abs/2606.31145
作者: Amirhossein Abaskohi,Giuseppe Carenini,Peter West,Yuhang He
机构: University of British Columbia (不列颠哥伦比亚大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models increasingly operate over long contexts, where the KV cache becomes a dominant memory bottleneck: its size grows linearly with sequence length and must be retained throughout decoding, making full GPU caching prohibitively expensive without compression. Existing KV cache compression methods struggle to balance efficiency with faithful context preservation. Token eviction discards information, while semantic grouping fixes compression decisions at prefill time; neither can recover token-level detail from a compressed span once it becomes relevant during generation. As a solution, we propose SeKV, a resolution-adaptive semantic KV cache that organizes context into entropy-guided semantic spans and stores them across a GPU-CPU memory hierarchy without discarding information. Each span keeps a lightweight summary vector on GPU for coarse routing and a low-rank SVD basis on CPU for on-demand token-level reconstruction. A trained zoom-in mechanism selectively expands query-relevant spans during decoding, enabling precise retrieval without materializing the full KV cache on GPU. SeKV enables adaptive token-level reconstruction while keeping the base LLM fully frozen and adding fewer than 0.05% trainable parameters. Across four benchmarks, SeKV improves over the strongest semantic compression baseline by 5.9% on average while reducing GPU memory by 53.3% versus full KV caching at 128K context. Code is available on this https URL.
[NLP-53] UniSAE: Unified Speech Attribute Editing on Speaker Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling
【速读】: 该论文旨在解决现有语音编辑方法在内容、说话人和情感编辑任务中割裂处理所导致的粒度粗略与灵活性不足的问题。其核心挑战在于如何实现跨层级(从子音素到词级)的统一、可组合的语音属性编辑,同时保持各属性之间的解耦控制。解决方案的关键在于提出UniSAE框架,通过引入离散音素后验图(Discrete Phonetic PosteriorGram, DPPG)表示,将语音内容分解为编码音素身份、发音变体及持续时间的离散标记,从而支持直接的音素级乃至子音素级编辑;在此基础上,采用自回归内容变换器生成经编辑的DPPG序列以实现词级内容修改,并利用基于扩散模型的声学解码器,在解耦的说话人与情感表征条件下重建语音输出,最终实现三类属性在单一架构内的联合、灵活且高精度的可控编辑。
链接: https://arxiv.org/abs/2606.31128
作者: Chuanbo Zhu,Wuyou Zhou,Rongxiu Zhong,Shilei Zhang,Kun Qian,Yike Guo,Wei Xue
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granularity and flexibility. We propose UniSAE, a unified speech attribute editing framework which supports composable speaker, emotion and content editing from sub-phoneme to word level within a single architecture. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The edited sequences are rendered into speech by a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations. Experimental results demonstrate that the proposed unified framework supports precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes within a single framework.
[NLP-54] When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking
【速读】: 该论文旨在解决少样本选择(few-shot selection)中普遍假设重排序(reranking)始终能提升性能的问题,指出昂贵的重排序步骤实际上可能降低模型表现。其核心解决方案是提出一种无需训练的门控重排序(Training-Free Gated Reranking)方法,该方法基于模型对输入样本的不确定性判断,动态决定是否对少样本示例进行重排序。实验在8个大语言模型(LLM)上覆盖7个自然语言理解(NLU)数据集和9个机器翻译(MT)领域-语言组合中验证了该方法的有效性,结果表明该方法可降低15%–80%的计算开销,同时平均性能提升最高达2%。研究揭示了高计算成本并不必然带来性能增益,且重排序仅在高不确定性实例上最具价值,从而为高效、智能的少样本推理提供了新范式。
链接: https://arxiv.org/abs/2606.31087
作者: Orian Dabod,Amir Cohen,Gabriel Stanovsky
机构: The Hebrew University of Jerusalem; OriginAI / Ramat Gan, Israel
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Few-shot selection typically assumes that reranking retrieved examples always improves performance. We challenge this view by identifying that the expensive reranking step can in fact degrade performance. Instead, we propose \emphTraining-Free Gated Reranking, which decides whether to rerank the few-shot examples based on the model’s uncertainty. Extensive experiments across 8 LLMs, covering 7 NLU datasets and 9 MT domain-language combinations, demonstrate that our approach reduces computational costs by 15%-80% while improving average performance by up to 2%. These findings indicate that higher computational cost does not guarantee better performance, and that reranking is most beneficial when targeted at high-uncertainty instances.
[NLP-55] riospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks ACL
【速读】: 该论文旨在解决现有生成式AI(Generative AI)文本检测方法在面对操纵文本特征的攻击时鲁棒性不足的问题。其解决方案的关键在于提出一种新颖的三重视角检测框架(Triospect Detection Framework),该框架通过引入文本内容(核心思想)与表达(风格元素)两个额外分析维度,从多角度综合评估文本的真实性,从而有效识别经过对抗性修改的生成文本。实验结果表明,该框架在涵盖17种攻击、12个领域和17个源模型的两个基准测试中均表现出显著优于强基线模型的性能,尤其在人类化后攻击子集(Humanize-16K after-attack)和对抗性RAID数据集上分别实现了22.3%(AUROC)和13%(TPR01)的提升,验证了其在统计方法层面增强检测可靠性方面的有效性。
链接: https://arxiv.org/abs/2606.31074
作者: Guangsheng Bao,Lihua Rong,Yanbin Zhao,Xiao Yu,Qiji Zhou,Yue Zhang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: TACL final version, 12 pages, 9 figures, and 9 tables
Abstract:Existing AI-generated text detectors are vulnerable to attacks that manipulate textual characteristics. In this study, we propose a novel Triospect Detection Framework by using additional perspectives of content (core ideas) and expression (stylistic elements) within a given text. Experiments on two benchmarks involving 17 attacks, 12 domains, and 17 source models demonstrate that Triospect is robust against these attacks. It improves the strong baseline by a significant margin of 22.3% (AUROC) and 13% (TPR01) on the Humanize-16K after-attack subset, and by 9.1% (AUROC) and 22% (TPR01) on the adversarial RAID. This framework marks a pioneering effort in statistical methods to enhance detection reliability against attacks. We release our data and code at this https URL.
[NLP-56] Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems
【速读】: 该论文旨在解决生成式语音-语音(S2S)AI代理在对话韵律与节奏评估中缺乏可解释的、以语音本体为基准的度量标准的问题。现有方法依赖于汇总的人类统计数据进行评估,但此类统计无法准确反映特定说话者特征和交互状态下的声学属性变化,导致评估结果偏差较大。其解决方案的关键在于构建基于真实人类对话数据(来自Seamless Interaction数据集的4000+小时双人英语对话)的匹配参照体系,针对基频(F₀)均值、F₀表现力、语速、发音率、停顿比率及平均停顿时长等关键韵律指标,建立与输出样本在说话者特征与交互状态上相匹配的参考分层。在此基础上,提出一种基于百分位数的评估协议:从S2S输出波形中提取相同指标,与最匹配的人类参照分层进行比较,并报告百分位偏离值或5th–95th百分位外的异常标志。实验表明,使用匹配参照可显著降低对状态相关F₀表现力与节奏的误报率,使错误标志率更接近名义上的10%,同时使偏差方向具有行为可解释性。该方法作为感知与用户中心评估的补充,提供了可解释的行为合理性检验。
链接: https://arxiv.org/abs/2606.31055
作者: Ashish Hallur,Thomas Thebaud,Georgi Tinchev,Venkatesh Ravichandran,Laureano Moro-Velazquez
机构: Google(谷歌); Meta(元)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because F_0 , speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for F_0 mean, F_0 expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned F_0 expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.
[NLP-57] ADAPT: Attention Dynamics Alignment with Preference Tuning for Faithful MLLM s ECCV2026
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)中存在的幻觉问题,即模型在生成内容时与输入图像不一致。其核心问题是:在生成过程中,文本到图像的交叉注意力(text-to-image cross-attention)出现渐进式退化,导致注意力聚焦模糊或产生偏差,进而引发系统性幻觉。现有缓解策略多为结果导向,未能针对性干预这一内在机制。为此,作者提出ADAPT(Attention Dynamics Alignment with Preference Tuning)框架,其关键在于从注意力动态行为出发,直接干预交叉注意力过程。具体包括三项创新:基于早期解码阶段优化的视觉锚点(visual anchor),提供稳定的空间定位;注意力监督推理机制,可在线检测并纠正注意力漂移;以及视觉注意力引导的直接偏好优化(Visual Attention Guidance DPO),使模型偏好更符合视觉事实的输出。实验表明,各组件均有效降低幻觉率,完整框架在主流模型上实现40%-60%的幻觉率下降,同时保持了模型的通用多模态能力,为通过分析和调控内部注意力行为来抑制幻觉提供了新的视角。
链接: https://arxiv.org/abs/2606.31054
作者: Zhiyuan Yao,Zheren Fu,Zhixiao Zheng,Jiajun Li,Yi Tu,Zhendong Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted by ECCV 2026
Abstract:Multimodal Large Language Models (MLLMs) are critically hampered by hallucination, generating content inconsistent with the provided image. In this paper, we identify an internal signature of hallucination: progressive degradation of text-to-image cross-attention during generation, leading to specific failure patterns like unfocused or biased attention. Existing mitigation strategies are largely outcome-driven and do not explicitly target this failure mode. To address this problem, we propose ADAPT (Attention Dynamics Alignment with Preference Tuning), an attention-based framework that intervenes directly on text-to-image cross-attention dynamics. We propose ADAPT with three key contributions: a cross-attention visual anchor refined from early decoding to provide stable spatial grounding, an attention-supervised inference mechanism that detects and corrects attention drift online, and a Visual Attention Guidance DPO that aligns preferences toward visually grounded responses. Experiments show that each component of ADAPT contributes to hallucination reduction, and the full framework achieves new best results across multiple hallucination benchmarks, reducing hallucination rates by 40%-60% across mainstream backbones while preserving general multimodal capabilities. Our work provides an attention-based perspective on mitigating hallucinations by exploring the model’s internal text-to-image cross-attention behaviors. Code is available at this https URL
[NLP-58] A Semantic-Layer-Mediated Agent for Natural Language to SQL over Heterogeneous Enterprise Databases
【速读】: 该论文旨在解决在真实企业级数据库上进行自然语言到SQL(NL2SQL)转换的挑战问题,此类场景因复杂的物理模式(包含数百个命名晦涩的表)、异构的SQL方言以及需要嵌套聚合、时间推理和多表连接的复杂分析任务而显著困难。其解决方案的关键在于提出一种基于语义层(semantic layer)的中介式NL2SQL代理系统,通过将语义意图与底层物理SQL执行解耦,避免直接在原始模式上生成SQL。该系统采用一种紧凑的中间表示——语义模型查询(Semantic Model Query, SMQ),使代理在经过精心设计的语义层上进行推理;随后,一个确定性编译器将每个SMQ转化为特定数据库方言的SQL,提供可验证的构建模块,由代理组合成最终查询。该系统采用受限的“思考-行动”循环架构,支持SQLite、BigQuery和Snowflake后端,并集成于端到端评估框架中。实验表明,基于Gemini 3 Pro的系统在Spider2-snow基准的547个任务上达到94.15%的执行准确率,在官方排行榜中位列第三,显著优于仅依赖模式信息的方法。研究还深入探讨了语义层质量对性能的影响,以及增强语义锚定(grounding)与过拟合之间的权衡。
链接: https://arxiv.org/abs/2606.31041
作者: Ha Jeong Kim,Saksonita Khoeurn,Ye Ji Yoon
机构: Chungbuk National University (忠北大学)
类目: Computation and Language (cs.CL)
备注: Submitted to FITAT 2026 for peer review
Abstract:Natural language-to-SQL (NL2SQL) over real-world enterprise databases remains significantly more challenging than on academic benchmarks. Enterprise schemas often contain hundreds of physical tables with cryptic column names, heterogeneous SQL dialects, and complex analytical workloads requiring nested aggregations, temporal reasoning, and multi-table joins. We present a semantic-layer-mediated NL2SQL agent that decouples semantic intent from physical SQL execution. Rather than generating SQL directly over raw schemas, the agent reasons over a curated semantic layer through a compact intermediate representation called the Semantic Model Query (SMQ). A deterministic compiler translates each SMQ into dialect-specific SQL, providing verified building blocks that the agent composes into the final query. The system employs a constrained think-act loop, supports SQLite, BigQuery, and Snowflake backends, and is integrated into an end-to-end evaluation framework. Using Gemini 3 Pro, the system achieves 94.15% execution accuracy on the 547-task Spider2-snow benchmark, ranking third on the official leaderboard and substantially outperforming schema-only approaches. We describe the system architecture, SMQ representation, agent workflow, evaluation results, and discuss semantic-layer quality and the trade-off between improved grounding and overfitting.
[NLP-59] ruth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对逻辑谬误(logical fallacies)等操纵性语言模式时的鲁棒性不足问题。尽管现有研究多聚焦于模型识别或分类谬误的能力,但其在持续对抗性说服下的稳定性仍缺乏系统评估。为此,作者提出LoFa(Logical Fallacy)基准,通过多智能体(multi-agent)生成流程将事实性问题与谬误论据配对,并引入多轮辩论框架以检验模型在持续对抗性攻击下的表现。为区分谬误鲁棒性与模型固有知识局限性,进一步提出LFR@k(Logical Fallacy Resistance at k)指标,量化模型对谬误攻击的抵抗能力。实验结果表明,不同模型在各类谬误中表现出显著差异的脆弱性特征,揭示了其鲁棒性分布的异质性。该研究的关键在于构建兼具生成性与评估性的综合基准体系,并设计可分离知识与推理能力的鲁棒性度量方法。
链接: https://arxiv.org/abs/2606.31039
作者: Xudong Shen,Li Yuan,Ye Chen,Xin Wu,Yi Cai,Zhiyong Wu
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); South China University of Technology (华南理工大学)
类目: Computation and Language (cs.CL)
备注: Accepted to ACL 2026 Main. 33 pages (9 pages main text)
Abstract:Large Language Models (LLMs) exhibit strong semantic capabilities, yet their resilience to manipulative linguistic patterns such as logical fallacies remains underexplored. Prior work has primarily examined whether LLMs can identify or classify fallacies, leaving their robustness against fallacious persuasion insufficiently studied. To address this gap, we introduce LoFa (Logical Fallacy), a comprehensive benchmark for evaluating LLM robustness against fallacies. LoFa is constructed through a multi-agent pipeline that pairs factual questions with fallacious arguments, and is accompanied by a multi-round debate framework for assessing model resilience under sustained adversarial persuasion. To disentangle fallacy robustness from a model’s inherent knowledge limitations, we further propose Logical Fallacy Resistance at k (LFR@k), a metric that quantifies resistance to fallacious attacks. Experiments show that LLMs exhibit varying levels of robustness across different fallacy types, revealing distinct vulnerability profiles among models.
[NLP-60] CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations
【速读】: 该论文旨在解决长文本生成中检索增强生成(Retrieval-Augmented Generation, RAG)系统产生的局部化幻觉(hallucination)难以精准定位的问题。现有方法多在句子或段落层面进行检测,无法实现对幻觉内容的细粒度识别。CORTEX提出一种基于标记级(token-level)的幻觉检测方法,其核心在于:真正基于检索文档的内容在生成过程中应受到文档更强的影响,而幻觉内容则不具备这种依赖性。为此,CORTEX通过对比大语言模型(Large Language Model, LLM)在有无检索文档输入下的内部表示差异,捕捉文档对生成过程的诱导效应;同时引入前序标记间信息传播机制,缓解因已有证据被上下文吸收而导致的误报问题;最后通过后处理平滑步骤建模幻觉标签在连续片段中的持续性,抑制局部噪声并提升预测的一致性。实验结果表明,CORTEX在两个RAG基准和三种LLM上均显著提升了标记级幻觉检测性能,且各组件均贡献稳定增益。
链接: https://arxiv.org/abs/2606.31033
作者: Kazuaki Furumai,Shuichiro Haruta,Kazunori Matsumoto,Daisuke Kamisaka
机构: KDDI Research, Inc.
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we propose CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation (RAG). In long-form RAG outputs, hallucinations often arise in localized spans rather than throughout an entire response. CORTEX therefore identifies ungrounded content at the token level, enabling fine-grained localization of hallucinations. The key intuition behind CORTEX is that tokens grounded in retrieved documents should be more strongly influenced by those documents than hallucinated tokens. To capture this document-induced effect, CORTEX compares internal representations of a large language model (LLM) under two conditions: with and without the retrieved documents. Instead of relying solely on each token’s immediate sensitivity to the retrieved documents, CORTEX also leverages the propagation of document-grounded information through preceding tokens, reducing false positives for tokens whose evidence has already been absorbed into the context. Finally, CORTEX applies post-processing smoothing step that models the tendency of hallucination labels to persist over contiguous spans, reducing local noise and encouraging span-consistent predictions. Experiments on two RAG benchmarks and three LLMs show that CORTEX substantially improves token-level hallucination detection, with each component consistently contributing to performance gains.
[NLP-61] Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization
【速读】: 该论文旨在解决自然语言到Lean形式化过程中形式化陈述的忠实性(faithfulness)评估难题,即现有基于编译通过率(compile-rate)的评估方法无法有效识别形式化结果在语义上是否忠实于原始自然语言命题,因为一个形式化声明可能通过类型检查但遗漏假设、改变定义域或表达平凡命题。其解决方案的关键在于提出一种结合Lean编译、跨模型语义判断与人工专家校准的多阶段评估协议,以构建一个更可靠的“共识忠实性”(consensus faithfulness)指标。实验表明,尽管工具增强型代理的编译通过率达89.5%,但其共识忠实性仅为60.5%,暴露出29.0个百分点的“编译通过但语义不忠实”差距。通过案例级人工审计验证,该指标具有保守决策边界:96.0%的共识正例被人工确认为忠实,而82.4%的编译通过但共识负例被确认为语义失败。进一步采用2³因子设计分解形式化流程中的三项关键干预措施——参数化专家起草、Mathlib/上下文检索和Lean求解反馈——发现:求解反馈对提升形式有效性影响最大,但亦加剧语义失败风险;上下文检索主要提升形式化接地性与选择性;而微调后的起草模块在具备反馈与接地能力后基本可被替代。研究强调,形式有效性、面向证明的Lean能力与忠实陈述生成应分别报告,以避免误导性性能评估。
链接: https://arxiv.org/abs/2606.31002
作者: Ke Zhang,Patricio Gallardo Candela,Sudhir Murthy,Yi Xie,Zhi Wang,Maziar Raissi
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: 25 pages, 5 figures
Abstract:Theorem-proving benchmarks evaluate proof search against fixed formal statements, but natural-language-to-Lean formalization must generate the formal statement itself. In this setting, compilation is only a validity check: a Lean declaration may type-check while omitting hypotheses, changing domains, or expressing a vacuous claim. We study faithful statement formalization as both an evaluation problem and a bottleneck-attribution problem. On a 400-entry graduate-level benchmark spanning real analysis, complex analysis, topology, and algebra, our protocol combines Lean compilation, cross-model semantic judging, and human expert calibration. The resulting picture is different from compile-rate evaluation: a full tool-augmented agent reaches 89.5% compilation but only 60.5% consensus faithfulness, exposing a 29.0-point compile-pass but consensus-unfaithful gap. Targeted human audits support the metric as a conservative decision boundary: across available case-level audits, 96.0% of consensus-positive outputs are human-confirmed faithful, while 82.4% of compile-pass consensus-negative outputs are human-confirmed semantic failures. Under this metric, existing one-shot formalizer models and prover-oriented Lean models remain low, suggesting that formal validity, proof-oriented Lean competence, and faithful statement generation should be reported separately. We then use a full 2^3 factorial design to decompose three recurring interventions in formalization pipelines: parametric expert drafting, Mathlib/context search, and Lean elaboration feedback. Elaboration feedback is the largest validity intervention, but it also exposes a larger compile-pass semantic-failure bucket; search mainly improves grounding and selectivity; and fine-tuned drafting is largely substitutable in this tool stack once feedback and grounding are available.
[NLP-62] Wait am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中仍存在的社会偏见问题,尤其聚焦于一种名为“演绎式刻板印象”(deductive stereotyping)的失败模式。该现象表现为模型将群体层面的统计规律错误地应用于个体案例,生成逻辑上自洽但具有社会偏见的推论。其核心问题在于:尽管现代LLM的推理能力有所提升,但在缺乏对公平性敏感的约束下,仍会系统性地放大隐含于训练数据中的偏见。解决方案的关键在于提出一种推理时注入(reasoning-time injection)框架,通过在推理阶段动态引入特定的公平性提示语(injection phrases)来引导模型进行更公平的推理。进一步地,作者提出Fair-GCG方法,以系统化方式发现高效且泛化的注入短语。实验表明,这些由Fair-GCG发现的注入短语不仅能显著提升多个公平性基准上的表现,还具备跨模型规模的可迁移性,有效增强模型在推理层级的公平性,降低开放式生成中的偏见,并成功应用于现实世界中对公平性敏感的任务。
链接: https://arxiv.org/abs/2606.30989
作者: Naihao Deng,Yilun Zhu,Joan Nwatu,Clayton Scott,Rada Mihalcea
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Warning: This paper contains several toxic and offensive statements. While reasoning generally improves fairness in recent large language models (LLMs), failures persist. In this work, we identify a failure mode, deductive stereotyping, in which models apply population-level statistical regularities to individual cases, producing logically coherent yet socially biased inferences. We provide a statistical interpretation of this phenomenon. To steer models toward fairness-aware reasoning, we propose a reasoning-time injection framework. We further introduce Fair-GCG to systematically discover effective injection phrases. Injection phrases discovered by Fair-GCG improve performance across multiple fairness benchmarks, generalize from smaller to larger LLMs, improves reasoning-level fairness, reduces bias in open-ended generation, and transfer to real-world fairness-sensitive tasks.
[NLP-63] Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments
【速读】: 该论文旨在解决在大规模场景下难以有效衡量专家判断中解释质量的问题。现有方法在处理海量自然语言解释时缺乏可扩展性和客观性,而传统的文本分析手段又无法充分捕捉解释背后的推理逻辑。为此,作者提出解释质量标记(Explanation Quality Markers, EQMs),这是一套基于理论指导的60种推理模式,通过大语言模型(Large Language Models, LLMs)进行自动化评分。其核心解决方案在于利用LLM对自然语言解释中的认知结构进行系统性识别与量化,从而构建可量化的解释质量指标。实证研究表明,在超过5.5万条来自多年预测竞赛的预测-解释配对数据中,EQMs在预测个体预测准确率方面表现优异,不仅在预测层级上优于传统文本分析方法,且在预测者层级也具有较强竞争力;更重要的是,其信号具有不对称性——更擅长识别潜在低绩效者而非顶尖预测者。此外,相较于人类对解释质量的主观评价,EQMs与准确性的相关性更为稳定,且不受解释长度的干扰。研究结果在独立预测研究中亦得到验证,表明EQMs是一种可扩展、可解释且有效的从书面解释中提取判断相关特征的方法。
链接: https://arxiv.org/abs/2606.30987
作者: Christopher W. Karvetski,Sheldon S. Huang,Simas Kučinskas,Nadja Flechner,Jingyu Hu,Philip Tetlock,Ezra Karger
机构: Forecasting Research Institute; University of Toronto; Vector Institute for Artificial Intelligence; Stanford University; Federal Reserve Bank of Chicago
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:
Abstract:Decision-makers routinely rely on expert judgments accompanied by written explanations, yet explanation quality is difficult to measure at scale. Forecasting tournaments offer a natural testing ground: probabilistic judgments are paired with natural-language rationales and scored against realized outcomes. We introduce Explanation Quality Markers (EQMs), a set of sixty theory-guided reasoning patterns scored by large language models (LLMs). In a pre-registered analysis of over 55,000 forecast-rationale pairs from a multiyear forecasting tournament, EQMs predict accuracy at both the forecast and forecaster levels, consistently outperforming pre-LLM text-analysis methods. More than 90% of statistically significant pattern-level EQM-accuracy correlations match our directional hypotheses. The signal is asymmetric: EQMs identify likely underperformers more reliably than they distinguish the very best forecasters. Benchmarked against traditional indicators of forecasting skill, EQMs are the strongest predictor at the forecast level and competitive at the forecaster level, though weaker than prior accuracy. Human ratings of rationale quality are less consistently correlated with accuracy and place disproportionate weight on rationale length. Results transfer to an independent forecasting study. EQMs provide a scalable, interpretable method for extracting judgment-relevant information from written explanations.
[NLP-64] From Propositional to Perceptual Asymmetry: Extending Frictive Policy Optimization to Asymmetric Partial Information Dialogue SIGDIAL2026
【速读】: 该论文旨在解决协作对话中因感知不对称(perceptual asymmetry)导致的指称理解失败问题,即对话双方在信息状态上存在部分不对称,同一指称表达在不同参与者的信息背景下可能产生不同的指称结果。传统基于摩擦的策略优化(FPO)仅适用于命题不对称(propositional asymmetry)场景,即双方共享相同感知情境但对命题解释存在差异,而本文将其扩展至更复杂的感知不对称情形。其解决方案的关键在于:重新定义摩擦作为认知信号的有效性需基于个体的信息视域(information horizon)进行评估,并揭示在不同地标配置下会引发定性不同的共知构建失败模式;特别地,少数模糊配置通过看似成功但隐性偏离的对话轨迹主导了大部分误解。此外,通过大语言模型(LLM)探针验证,拥有“正确视角”比掌握全部视角更具优势,表明单一知情视角优于全知访问双方面信息。为此,作者提出两项标注改进:对未完成共知状态进行子类型分解,以及引入考虑适应性的对齐分类机制,以提升对话系统在复杂不对称情境下的指称鲁棒性。
链接: https://arxiv.org/abs/2606.30973
作者: Yifan Zhu,Kyeongmin Rim,James Pustejovsky
机构: Brandeis University (布兰迪斯大学)
类目: Computation and Language (cs.CL)
备注: 11 pages. To appear in Proceedings of SIGDIAL 2026
Abstract:Frictive Policy Optimization (FPO; Pustejovsky et al., 2025) treats friction in collaborative dialogue – misalignment, misunderstanding, repair – as an epistemic signal essential to common-ground construction, rather than noise to be minimized. However, FPO and its implementations assume shared perceptual contexts, where friction arises from differently interpreted propositions over the same scene, which we define as propositional asymmetry. We extend FPO to perceptual asymmetry, where participants hold asymmetric partial information and the same referring expression yields different denotations depending on whose information state grounds the reference. We evaluate this through cross-corpora analysis and LLM probing on referentially asymmetric dialogue tasks, primarily the HCRC MapTask (Anderson et al., 1991). We find that FPO’s friction functional is empirically valid only when evaluated from within each participant’s information horizon: different landmark configurations produce qualitatively distinct grounding failure modes, with a small class of ambiguous configurations driving a disproportionate share of misunderstandings through trajectories that appear successful but silently diverge. The LLM probe confirms that having the “right perspective” matters more than having all perspectives: the informed single viewpoint outperforms omniscient access to both participants’ contexts. We propose two annotation refinements: subtype decomposition of pending grounding states and accommodation-aware alignment classification.
[NLP-65] Linguistic Distancing on Social Media: Indicators of Emotion Regulation Across Age Groups
【速读】: 该论文旨在解决情绪调节(emotion regulation)随年龄变化的机制问题,特别是探究个体在不同生命阶段如何通过语言表达实现心理距离(psychological distancing)以调节情绪。其核心解决方案在于构建并验证一种可量化的语言距离化(linguistic distancing)指标,通过分析大规模社交媒体文本数据,系统考察该指标在不同年龄群体中的演变规律。研究发现,语言距离化现象随年龄增长呈显著增加趋势,这一结果为心理距离与情绪调节能力正相关的理论提供了实证支持,并揭示了语言特征作为心理健康状态生物标志物的潜力。该研究为未来基于文本数据开展情绪调节机制研究及开发精准健康干预策略奠定了方法学基础。
链接: https://arxiv.org/abs/2606.30957
作者: Daniela Teodorescu,Saif M. Mohammad,Alona Fyshe
机构: 未知
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, Computational Affective Science Workshop
Abstract:Managing our emotional responses to events is key to emotional well-being, a process referred to as emotion regulation in psychology. Previous work has established that the degree to which we distance events is a type of emotion regulation. When we psychologically distance from events there can be markers in our language. These markers have been referred to as linguistic distancing. We build upon a previous metric to operationalize linguistic distancing, and explore how it changes across the lifespan. We explore this systematically by analyzing large amounts of social media text, a venue where people express their emotions. By investigating how distancing varies across age groups we can better understand how emotion regulation varies with age and provide initial benchmarks on social media data. We provide additional evidence further strengthening the hypothesis that linguistic distancing occurs in proportionally more instances with age. These findings align with past work in psychology which indicate improved well-being with older age. Better understanding how linguistic distancing changes with age is important because it functions as a marker of well-being and can inform effective health interventions. We provide a foundation for further exploring emotion regulation through linguistic distancing in text data.
[NLP-66] Bridging Scientific Heritage: An Arabic–Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer
【速读】: 该论文旨在解决阿拉伯语与俄语之间科学文献翻译的语言障碍问题,以促进两大语言社群在可持续发展相关研究领域的知识共享与国际合作。其核心挑战在于跨语言科学文本的准确、专业翻译,尤其在缺乏高质量平行语料和领域适配模型的情况下。解决方案的关键在于构建一个包含约2.7万句对的混合平行语料库(hybrid parallel corpus),涵盖科学摘要与通用领域文本(如宗教、新闻、对话),并基于多语言大模型(mT5-base、NLLB-200-distilled-1.3B、Qwen2.5-7B-Instruct)采用LoRA(Low-Rank Adaptation)进行高效微调。实验表明,采用QLoRA(rank 8)的Qwen2.5-7B模型在多项指标上显著优于零样本基线,其中BLEU提升4.36,COMET提升0.051;而少量示例提示(few-shot prompting)效果不佳,凸显了领域特定微调的必要性。研究成果包括开源的模型、语料库与评估代码,有效降低了科学文本的跨语言交流壁垒,支持联合国可持续发展目标(SDG 9与SDG 17)下的技术驱动型可持续发展。
链接: https://arxiv.org/abs/2606.30943
作者: M. K. Arabov
机构: 未知
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:Russian and Arabic are among the major languages of scientific communication. Language barriers impede the exchange of research results between these communities, which affects international collaboration and the progress of sustainability-related research. We present a benchmark for Arabic–Russian scientific translation. The benchmark includes a hybrid parallel corpus of about 27,000 sentence pairs, compiled from scientific abstracts and general-domain texts (religion, news, conversations). We fine-tune three multilingual language models – mT5-base (580M parameters), NLLB-200-distilled-1.3B (1.3B), and Qwen2.5-7B-Instruct (7B) – using LoRA with ranks 8, 16, 32, and 64. The Qwen2.5-7B model with QLoRA (rank 8) yields BLEU 23.15, chrF 43.89, BERTScore 0.906, and COMET 0.758. These are +4.36 BLEU and +0.051 COMET above the zero-shot baseline. Few-shot prompting with three examples does not improve performance, indicating that domain-specific fine-tuning is required. We release the models, the corpus, and the evaluation code. By lowering the language barrier for scientific texts, the work enables knowledge exchange between Arabic-speaking and Russian-speaking researchers. It contributes to sustainable partnerships (UN SDG 17) and innovation infrastructure (SDG 9), aligning with the conference’s focus on technology-driven sustainable development.
[NLP-67] Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text
【速读】: 该论文旨在解决生成式 AI 在低资源语言(如孟加拉语)中事件检测(Event Detection, ED)任务上对现实世界噪声数据鲁棒性不足的问题。现有评估多基于干净、经过清洗的文本,难以反映模型在真实场景下的表现,尤其是在自动语音识别(ASR)转录文本和拼写错误文本等噪声环境下的性能。为此,研究提出了一套通用的孟加拉语新闻事件本体,并构建了一个包含9,979条标注句子的基准数据集,覆盖40种事件子类型及三种不同噪声水平的数据源(干净新闻文本、真实世界ASR转录文本、拼写错误文本)。关键解决方案在于系统性对比编码器架构(BanglaBERT、XLM-R)与解码器架构(Llama 3、Gemma 3)大语言模型在噪声条件下的表现差异,揭示了架构层面的根本权衡:编码器模型在干净文本上表现更优,但对噪声敏感;而解码器仅有的大语言模型则展现出显著更强的鲁棒性,尤其在事件触发词受损时仍能保持较高性能。此外,通过在指令微调中嵌入注释指南可建立噪声文本上的更高基线性能,但其对不同噪声类型的降级缓解效果不一致。最后,模型规模扩展显著提升解码器模型的鲁棒性,而联合训练清洁与噪声数据作为正则化策略,对编码器架构具有显著增益,有效缩小了两类模型间的鲁棒性差距。
链接: https://arxiv.org/abs/2606.30914
作者: Tanvir Ahmed Sijan,S. M Golam Rifat,Nayeemul Islam,Md. Musfique Anwar
机构: Jahangirnagar University (乔拉吉纳加尔大学); Rajshahi University of Engineering & Technology (拉杰沙希工程与技术大学); Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computation and Language (cs.CL)
备注: 17 pages, 8 figures
Abstract:Event detection (ED) systems are typically evaluated on clean, curated text, leaving their robustness to real-world noise largely unexplored, particularly for low-resource languages such as Bangla. We introduce a generalized Bangla news event ontology and a benchmark comprising 9,979 annotated sentences across 40 event subtypes, spanning clean news text, real-world Automatic Speech Recognition (ASR) transcripts, and orthographically corrupted text. We systematically evaluate fine-tuned encoder-only models (BanglaBERT and XLM-R) alongside instruction-tuned decoder-only large language models (Llama 3 and Gemma 3). Our results reveal a clear architectural trade-off: encoder models achieve higher performance on clean text but degrade substantially under noise, whereas decoder-only LLMs are markedly more robust, particularly when event triggers are corrupted. We further show that embedding annotation guidelines during instruction tuning establishes a higher performance baseline on noisy text but yields inconsistent reductions in performance degradation across noisy conditions. Finally, model scaling consistently improves the robustness of decoder-only LLMs, while combined training on clean and noisy data serves as an effective regularization strategy that disproportionately benefits encoder architectures, significantly narrowing the robustness gap.
[NLP-68] Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning
【速读】: 该论文旨在解决多语言、多文化及多事件在线极化检测中的复杂挑战,针对SemEval-2026任务9的三个子任务——二元极化检测、极化类型分类以及极化表现形式识别——提出统一解决方案。其核心问题在于跨语言场景下标签严重不平衡与多标签分类的精度不足,尤其在低频极化现象(如去人性化、缺乏同理心)上的识别能力薄弱。解决方案的关键在于:采用基于Transformer的预训练模型(英语使用RoBERTa-base,斯瓦希里语使用AfroXLMR-base),结合类别加权损失函数以缓解标签分布不均的影响,并通过逐标签阈值调优优化多标签分类性能。实验结果表明,该方法在测试集上取得了具有竞争力的F1宏平均得分,尤其在斯瓦希里语任务中表现突出,验证了其对不平衡多标签极化检测的有效性。然而,误差分析显示,模型在去人性化和缺乏同理心等隐性极化表现形式的识别上仍存在明显局限,提示未来需进一步增强对深层社会情感语义的理解能力。
链接: https://arxiv.org/abs/2606.30857
作者: Aaron Bundi Anampiu
机构: African Institute for Mathematical Sciences, South Africa
类目: Computation and Language (cs.CL)
备注:
Abstract:This paper describes our submission to SemEval-2026 Task 9 on detecting multilingual, multicultural, and multievent online polarization. We address all three subtasks: binary polarization detection, polarization type classification, and manifestation identification for English and Swahili. Our approach leverages transformer-based models (RoBERTa-base for English, AfroXLMR-base for Swahili) with class-weighted loss functions to address severe label imbalance and per-label threshold tuning to optimize multi-label classification. On the test set, we achieve F1 macro scores of 0.7901 (English) and 0.7910 (Swahili) for Subtask 1, 0.4615 (English) and 0.4808 (Swahili) for Subtask 2 and 0.4791 (English) and 0.5830 (Swahili) for Subtask 3, which give competitive performance on the leaderboard, demonstrating the effectiveness of our methods for handling imbalanced multi-label polarization detection. Our error analysis reveals that models struggle with dehumanization detection and lack of empathy.
[NLP-69] When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models
【速读】: 该论文旨在解决生成式推理模型(Generative Reasoning Models)在不同推理实例中计算资源分配不均的问题,核心关注点在于:在固定计算预算下,基于学习的停止规则(learned stopping rule)是否能有效超越传统的置信度或收敛性阈值等单一标量停止策略。其解决方案的关键在于提出一种无需依赖隐藏状态的检查点停止器 LearnStop,通过在固定预算检查点处对当前推理前缀(reasoning prefix)进行短答案探测,并利用在线特征(如答案置信度、熵、前缀投票占比、答案稳定性及回溯标记密度)预测前缀正确性,从而实现动态决策。研究结果表明,学习型停止策略的有效性具有任务依赖性:在自由形式数学题(如GSM8K与Qwen3-32B)中,多特征融合的停止规则显著优于固定预算基线,实证达到+0.157的后验增益,且在验证集选择的操作点上仍保持正向收益;而在多项选择题或极难题目中,单标量规则(如置信度、熵或稳定性)更具竞争力。因此,论文指出学习型停止并非对单标量退出的通用替代方案,而是一种依赖于推理轨迹结构的工具——当多个问题在未耗尽预算前即已趋于正确,但缺乏可靠单一标量信号时,其优势最为明显;一旦置信度或答案收敛性已能有效解决停止问题,学习型停止的增益则基本消失。研究进一步提供了验证选择的操作点、配对自举检验、有限网格下的误判风险校准、不同内存管理场景(KV-fork、前缀缓存、黑盒)下的成本核算、H100服务性能分析、检查点调度扫描、迁移性分析与鲁棒性验证,系统性支持了上述结论。
链接: https://arxiv.org/abs/2606.30852
作者: Zhe Dong(University of Maine at Presque Isle),Fang Qin(Stanford University),Manish Shah(Independent Researcher)
机构: University of Maine at Presque Isle; Stanford University
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 5 figures
Abstract:Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density. Across 18 task-model settings spanning GSM8K, MATH-500, MMLU-Pro, AIME-90, GPQA, Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent. On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0.157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0.028. On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks. The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.
[NLP-70] st-Time Verification for Text-to-SQL via Outcome Reward Models ACL2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段于结构化推理任务(如Text-to-SQL)中可靠性不足的问题。现有测试时推理策略(如Best-of-N采样和多数投票)依赖执行成功或输出频率等启发式信号,难以在语义层面有效区分候选输出。其解决方案的关键在于引入结果奖励模型(Outcome Reward Models, ORMs),作为可学习的语义评分函数,用于测试时验证。论文提出GradeSQL框架,通过自动化生成候选答案并基于执行结果进行标签标注,实现无需人工注释的任务特定ORM训练。将ORM集成至验证驱动的Best-of-N管道后,在BIRD与Spider基准上均显著优于传统方法,最大提升达+4.33%(BIRD)和+2.10%(Spider),且在复杂查询和更大候选集下表现更优,证明了ORM-based验证是一种简单、高效且可扩展的替代方案。
链接: https://arxiv.org/abs/2606.30851
作者: Mattia Tritto,Giuseppe Farano,Dario Di Palma,Gaetano Rossiello,Fedelucio Narducci,Dharmashankar Subramanian,Tommaso Di Noia
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Accepted to the SURGeLLM Workshop at ACL 2026, San Diego, US
Abstract:Improving the reliability of large language models (LLMs) at inference time is a central challenge in structured reasoning tasks such as Text-to-SQL. Common test-time inference strategies, including Best-of-N sampling and Majority Voting, rely on heuristic signals such as execution success or output frequency, which provide limited semantic discrimination across candidate outputs. In this work, we study Outcome Reward Models (ORMs) as learned semantic scoring functions for test-time verification in Text-to-SQL. While ORMs have been previously explored for test-time scaling and alignment, their application to structured query generation remains underexplored. We introduce GradeSQL, a scalable framework for training task-specific ORMs via automated candidate generation and execution-based labeling, enabling verifier training without manual annotation. We integrate ORMs into a verification-driven Best-of-N pipeline and evaluate our approach on the BIRD and Spider benchmarks across multiple open-source LLM families. ORM-based selection consistently outperforms execution-based Best-of-N and Majority Voting, with gains of up to +4.33% on BIRD and +2.10% on Spider. We further show that ORMs scale effectively with larger candidate sets and yield stronger improvements on complex queries. Overall, our results demonstrate that ORM-based verification provides a simple, effective, and scalable alternative to heuristic test-time selection strategies for Text-to-SQL. Code datasets and models are publicly available.
[NLP-71] When transformers learn “impossible” languages what do they learn? CONLL2026
【速读】: 该论文旨在解决生成式语言模型为何无法习得“不可能语言”(impossible languages)这一现象背后的机制问题。现有研究多基于样本效率与测试集困惑度的差异推断模型对人类语言的偏好,但缺乏对可解释人类语言非可及性的语言能力维度的直接评估。本文提出并检验两个理论驱动的关联假设:一是语法敏感性缺陷,二是生成生产能力不足。通过在经扰动的“不可能”英语变体上训练类似GPT-2的模型,研究发现模型在使用BLiMP最小对立对评估语法敏感性时仅表现出渐进式性能退化,且该退化受语言信息局部性的影响;而在生成任务中,模型在长句生成时则出现显著失败,生成高质量句子的能力大幅下降。结果表明,生成能力缺陷与语言传播失败是连接语言模型行为与人类无法习得不可能语言现象的合理解释。
链接: https://arxiv.org/abs/2606.30815
作者: Ram Janarthan,Coleman Haley,Sharon Goldwater
机构: University of Edinburgh (爱丁堡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: CoNLL 2026 (Best Paper Award). 14 pages, 3 figures
Abstract:Recent work suggests that transformer language models show a bias towards human languages over unnatural (“impossible”) languages argued to be unacquirable by humans. However, this literature has largely based these claims on differences in sample efficiency and test-set perplexity, rather than on direct evaluations of the linguistic capacities that could plausibly explain non-attestation in human languages. We evaluate two theoretically motivated linking hypotheses: impossibility arising from deficiencies in grammatical sensitivity or generative production. Using GPT-2 style models trained on perturbed “impossible” variants of English, we measure sensitivity to grammaticality using BLiMP minimal pairs, finding that model performance exhibits only gradual degradation, mediated by the language’s information locality. In contrast, these models exhibited pronounced failures in generation, producing substantially fewer high-quality sentences at longer lengths. Together, these results suggest generative deficiency and transmission failures as a plausible linking hypothesis between language model behaviour and non-attestation of impossible languages.
[NLP-72] When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLM s
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)校准评估中因模型准确率差异导致的不公平比较问题。传统方法依赖全局校准度量(如期望校准误差 Expected Calibration Error 和 Brier Score),但这些指标在不同模型间进行比较时会受到模型自身准确率差异的混淆,从而产生误导性结论。论文提出了一种名为ACE(Accuracy-Controlled Evaluation)的准确性控制评估框架,其核心创新在于引入三种互补视角:实例对齐(Instance-Aligned)、分布对齐(Distribution-Aligned)与候选对齐(Candidate-Aligned)校准评估,以在控制准确率的前提下实现更公平的跨模型校准比较。通过在多个基准测试、模型族及置信度生成方法上的实证分析,研究发现许多先前基于原始全局指标观察到的校准优势在控制准确率后显著减弱,且频繁出现排名反转现象,即原本表现优异的模型在控制准确率后不再占据优势。研究结果表明,原始全局校准度量不具备跨模型比较的鲁棒性,而公平的校准评估必须采用考虑准确率影响的感知型评估方法。
链接: https://arxiv.org/abs/2606.30814
作者: Zhichao Yang,Caiqi Zhang,Ruihan Yang,Chengzu Li,Nigel Collier,Deqing Yang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Calibration evaluates whether a model confidence aligns with its empirical accuracy. Existing studies often compare the calibration of different large language models using global calibration metrics such as Expected Calibration Error and Brier Score. We begin by showing, both theoretically and empirically, that such comparisons are confounded by differences in model accuracy. For fairer cross-model comparison, we then propose ACE, an accuracy-controlled evaluation framework with three complementary views: Instance-Aligned, Distribution-Aligned, and Candidate-Aligned calibration. Across multiple benchmarks, model families, and confidence elicitation methods, we use ACE to study two practically important comparison axes, small versus large models and thinking versus non-thinking models. We find that many previously reported calibration advantages under raw global metrics weaken substantially after accuracy control. We also find that ranking reversal is frequent: models favored by raw metrics often cease to be favored once accuracy is controlled. Our results show that raw global calibration metrics are not robust for cross-model comparison, and that fair calibration comparison requires accuracy-aware evaluation.
[NLP-73] Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale
【速读】: 该论文旨在解决在线平台个性化推荐算法的黑箱审计难题,其核心挑战在于:现有审计方法在真实用户行为捕捉与可控性之间存在权衡——真实用户研究虽具生态效度但成本高、难控制,而基于脚本的傀儡账户(sock-puppet)审计则因行为模式缺乏真实性而受限,且二者均难以有效分离用户属性与行为,从而阻碍对个性化机制的因果理解。为此,论文提出一种基于生成式 AI (Generative AI) 代理的黑箱审计框架,利用具有固定人格特征(基于人口统计与政治调查数据构建)的合成账户代理作为行为引擎,使代理能够基于对内容的推理自主决策并交互。通过在平台可见信号(如年龄、性别、地理位置)上施加实验性扰动,该设计实现了对用户属性影响的反事实分析。以2024年美国大选后在X平台部署的1,120个代理为例,覆盖14种人格类型及三种反事实条件,共收集超20万次内容曝光数据,结果表明X平台的算法推荐流相较于时间线流显著放大了有毒、极化、政治性及右倾内容,且放大效应随用户意识形态剧烈变化;反事实分析进一步揭示,人口学信号的影响呈现显著的人格依赖性,总体平均效应接近零,但子群体层面的效果方向与强度差异显著。该研究确立了生成式AI代理作为算法审计新范式的可行性与有效性。
链接: https://arxiv.org/abs/2606.30801
作者: Alessandro Morosini,Sarah H. Cen,Andrew Ilyas,Hedi Driss,Aleksander Mądry,Chara Podimata
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
备注: 43 pages, 10 figures
Abstract:Personalization algorithms determine what content users encounter on online platforms. Auditing these systems is difficult because independent auditors have only black-box access to the algorithms, while personalization depends on users’ attributes, behavior, and evolving interaction histories. Existing auditing methods face a tradeoff: studies with real users capture realistic behavior but are costly and hard to control, whereas sock-puppet audits scale more easily but often rely on scripted behavior that limits realism. Beyond this, both approaches struggle to decouple user attributes from user behavior, limiting our ability to causally understand personalization. To address this gap, we introduce a framework for black-box audits of personalization algorithms using generative AI agents as behavioral engines for synthetic accounts. Each agent is instantiated with a fixed persona, grounded in demographic and political survey data, and interacts with a platform’s content by reasoning about it and choosing actions. Because behavior is fixed within each persona while platform-visible signals such as age, gender, or location can be experimentally perturbed, our design enables counterfactual auditing of how platforms respond to user attributes. As a case study, we deploy 1,120 agents on X shortly after the 2024 U.S. election, spanning 14 personas and three counterfactual conditions, collecting over 200,000 content exposures. We find that X’s algorithmic feed amplifies toxic, polarizing, political, and right-leaning content relative to the chronological feed, with amplification varying sharply by user ideology. Counterfactual analyses show that demographic signals affect content delivery in persona-dependent ways: pooled effects are largely null, while subgroup-level effects vary in direction and magnitude. Our work establishes GenAI-based agents as a new tool for algorithmic auditing.
[NLP-74] Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLM s on Romanized Indic-English Instructions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理印度语系罗马化代码混用(Indic Romanized Code-Mixed, Indi-RomCoM)文本时指令遵循与推理能力不足的问题。随着多语言社区中使用者在罗马字母书写系统下频繁混合本地语言与英语(即罗马化代码混用,RCM),现有主流LLMs在非原生书写系统及跨语言语境下的表现仍缺乏系统评估。为此,研究提出Indi-RomCoM基准,涵盖七类指令遵循任务、四种广泛使用的印地语系语言以及三种可控的代码混用强度层级,以实现对多种类型模型(包括商业闭源、开源及专注印地语系的模型)在零样本与少样本设置下的全面评估。研究发现,随着代码混用密度增加,所有模型性能均显著下降,且检测类任务(如毒性识别)比推理类任务受负面影响更严重,原因在于推理过程中生成的解释可提供必要上下文信息以缓解歧义。该研究的关键贡献在于构建了首个系统性评估多语言罗马化代码混用指令遵循能力的基准,并揭示了当前模型在真实世界多语言交互场景中的局限性,为开发更具包容性的多语言AI系统提供了重要依据。
链接: https://arxiv.org/abs/2606.30790
作者: Avisha Das,Mihir Parmar,Mohana Ramnath,Pulkit Verma
机构: Shiv Nadar University Chennai; Google Cloud AI Research; IIT Madras
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Romanized Code Mixing (RCM), where bilingual speakers fluidly blend local languages with English in Roman script, has emerged as the dominant form of communication across multilingual communities. While Large Language Models (LLMs) perform strongly on monolingual and native-script benchmarks, their ability to follow instructions and reason over RCM-based content remains largely unexplored. To this end, we introduce the Indi-RomCoM benchmark for facilitating systematic evaluation on Indic Romanized Code-Mixed instructions. Our benchmark spans seven instruction-following tasks, four widely spoken Indic languages, and three controlled code-mixing intensity levels. We extensively evaluate a suite of LLMs covering proprietary, open-weight, and Indic-focused models under zero- and few-shot settings. LLMs consistently underperform on RCM instructions, with performance degrading as code-mixing density increases. Furthermore, reasoning tasks suffer less degradation than detection tasks (e.g., Toxicity) because the generated explanations offer necessary context. We believe Indi-RomCoM helps the community in developing inclusive multilingual systems.
[NLP-75] Revocable Learned State via Process Sidecars
【速读】: 该论文旨在解决大语言模型在分阶段微调过程中,如何精确撤销记忆(memory)并保持安全对齐(safety alignment)的问题。具体而言,现有方法在完成安全训练后尝试“删除”记忆时,由于安全优化器已对记忆方向进行转移(transported),直接减去记忆更新(即使用任务算术,task arithmetic)无法准确恢复原始状态,导致第一阶反事实误差。其解决方案的关键是提出“过程侧车”(process sidecars)这一新型参数编辑框架,形式为 θ^(λ,γ)=θAMS−λΔM−γR^S←M,其中 R^S←M 为基于未来AdamW安全训练过程的中心差商估计,即 J^S,ε(ΔM)−ΔM,利用自然尺度下的 ε=1,复用 θAMS 作为正向端点,并仅需额外计算一次在 θA−ΔM 处的安全轨迹。理论证明表明:当使用真实传输方向 RS←M 时,该方法在 (λ,γ)=(1,1) 处可实现二阶精度的反事实安全最优解 θAS,且将AdamW视为参数、一阶矩与二阶矩构成的增广状态映射;更重要的是,该过程信息不可替代——任何标量任务算术编辑均存在第一阶误差,而过程侧车方法达到二阶精度。实验验证显示,在三个模型上,经验证集选择的二维编辑方案在所有测试中均优于朴素任务算术,且在所有配对试验中优于 γ=λ 的过程JVP子族。
链接: https://arxiv.org/abs/2606.30788
作者: John Sweeney
机构: Sideplane AI
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: 23 pages, 2 figures, 6 tables
Abstract:Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not the same problem as subtracting the memory update: the later safety optimizer has transported the memory direction. We introduce process sidecars, a two-coefficient edit family \hat\theta(\lambda,\gamma)=\theta_\mathrmAMS-\lambda\Delta_\mathrmM-\gamma\hatR_\mathrmS\leftarrow\mathrmM , with \hatR_\mathrmS\leftarrow\mathrmM=\hatJ_\mathrmS,\varepsilon(\Delta_\mathrmM)-\Delta_\mathrmM , where \hatJ_\mathrmS,\varepsilon is a centered secant through the realized future AdamW safety-training process. The implementation uses \varepsilon=1 at the natural memory-edit scale; it reuses \theta_\mathrmAMS as the positive endpoint and computes one additional safety trace at \theta_\mathrmA-\Delta_\mathrmM . We prove two things. First, the exact sidecar, using the true transported direction R_\mathrmS\leftarrow\mathrmM rather than the secant estimate, at (\lambda,\gamma)=(1,1) recovers the counterfactual safety-only oracle \theta_\mathrmAS up to second order; the proof treats AdamW as an augmented-state map over parameters, first moments, and second moments. Second, this process information is necessary: whenever future safety training bends the memory direction, every scalar task-arithmetic edit leaves first-order counterfactual error, while the process-sidecar edit is second-order accurate. Across three models, the validation-selected 2D edit improves held-out refusal closure over naive task arithmetic in all trials, and over the \gamma=\lambda process-JVP subfamily, the diagonal slice of the cached 2D grid, in all paired trials.
[NLP-76] A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization
【速读】: 该论文旨在解决企业级生成式 AI 代理在处理用户查询时因技能(skill)描述语义重叠导致的路由错误问题,即“技能碰撞”(skill collision)。当多个技能的自然语言描述存在重叠时,生成式 AI 路由模型(routing LLM)容易将用户请求错误地分配至不匹配的技能,严重影响系统准确性。随着技能数量增加,通过人工调整描述以维持路由精度成为难以持续的工程瓶颈。为此,论文提出并部署了一套自动化描述优化流水线,在一个包含9个技能、372个回归案例的生产环境群聊代理中验证:该流水线生成的技能描述平均达到79.2%的F1分数,与人工调优结果(79.4%)几乎无显著差异(平均单技能差异仅-0.20%,低于0.78%的多种子噪声阈值),同时将每个技能的工程耗时从120分钟降至3.8分钟,实现32倍加速。关键发现表明,流水线中最具影响力的组件是利用任意可用的误报(false-positive)和漏报(false-negative)案例进行一次大语言模型(LLM)重写,即可捕获绝大部分性能提升;其他设计因素(如迭代预算、反馈信号构成、混淆对双编辑策略及训练集规模)对最终F1的影响均小于0.5%。此外,研究识别出一种诊断指标——训练集与验证集间显著的F1差距,可有效标识出因技能本意范围真实重叠而无法通过文本优化解决的情况,提示需进行架构层面而非文本层面的干预。
链接: https://arxiv.org/abs/2606.30775
作者: Yangqiaoyu Zhou,Mohammad Alqudah,Kwei-Herng Lai,Aaron Halfaker,Yingqi Xiong,Yaar Harari
机构: Microsoft
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:Enterprise AI agents route user queries to specialized skills by matching queries against natural language skill descriptions. When two skills share overlapping descriptions, the routing LLM misroutes queries, a failure we term skill collision. As agents scale to dozens of skills, manually tuning descriptions to maintain routing accuracy becomes a significant engineering bottleneck. We deploy an automated description optimization pipeline on a production enterprise group chat agent (9 skills, 372 regression cases). The pipeline produces descriptions averaging 79.2% F1, matching manually tuned descriptions at 79.4% F1 (average per-skill difference -0.20%, within the 0.78% multi-seed noise floor), while reducing per-skill engineering effort from 120 minutes to 3.8 minutes (32 times speedup). We then examine which pipeline components actually drive this match. Systematic ablation on both the production system and ToolBench (16k tools) reveals that a single LLM rewrite using any available false-positive and false-negative cases captures most of the available improvement. Other design choices we tested (iteration budget, feedback signal composition, dual editing of confused pairs, and training set size) each affect final F1 by less than 0.5%. Description optimization addresses skill collisions caused by overlapping descriptions but cannot resolve cases where two skills intended scopes genuinely overlap. We identify a diagnostic (a large train-validation F1 gap) that flags the latter cases for architectural rather than text-level intervention.
[NLP-77] From Search to Synthesis: Training LLM s as Zero-Shot Workflow Generators
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在具体任务实例中生成的解决方案缺乏结构一致性的问题,这限制了其在实际部署中的可靠性与可复用性。现有方法虽通过工作流(workflow)编码任务级别的算法模式以提升鲁棒性、可解释性和复用性,但依赖人工设计,成本高昂且难以推广。当前自动工作流生成方法要么仅生成针对特定实例的解法而未能学习任务级通用模式,要么无法超越训练配置进行泛化。为此,本文提出MetaFlow,将工作流生成建模为元学习(meta-learning)问题:在给定任务与操作符集合的前提下,模型学习组合可泛化的解决方案策略。MetaFlow采用两阶段训练机制——首先在合成工作流数据上进行监督微调,随后通过可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards, RLVR),利用跨实例的任务执行反馈优化端到端成功率。该方法不仅在问答、代码生成和数学推理等基准任务上实现了与先进基线相当的单次推理性能,更展现出卓越的零样本泛化能力,能有效适应未见任务及新型操作符集合,显著提升了工作流生成的通用性与实用性。
链接: https://arxiv.org/abs/2606.30704
作者: Gan Luo,Zihan Qin,Bin Dong,Wotao Yin
机构: Peking University (北京大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 35 pages, 8 figures
Abstract:Large language models (LLMs) excel across a wide range of tasks, yet their instance-specific solutions often lack the structural consistency needed for reliable deployment. Workflows that encode recurring algorithmic patterns at the task level provide a principled framework, offering robustness across instance variations, interpretable traces for debugging, and reusability across problem instances. However, manually designing such workflows requires significant expertise and effort, limiting their broader application. While automatic workflow generation could address this bottleneck, existing methods either produce instance-specific solutions without learning task-level patterns, or cannot generalize beyond their training configurations. We present MetaFlow, which casts workflow generation as a meta-learning problem: given a task and an operator set, the model learns to compose solution strategies. MetaFlow trains in two stages: supervised fine-tuning on synthetic workflow data, followed by reinforcement learning with verifiable rewards (RLVR) that uses execution feedback across problem instances in the task to improve end-to-end success. The resulting model produces effective workflows for trained tasks and exhibits strong generalization to untrained tasks and novel operator sets. Across benchmarks in question answering, code generation, and mathematical reasoning, MetaFlow achieves performance comparable to state-of-the-art baselines on in-domain tasks with single inference, while demonstrating remarkable zero-shot generalization capabilities on out-of-domain tasks and operator sets.
[NLP-78] ViTL: Temporal Logic-Guided Zero-Shot Natural Language Navigation via Vision-Language Models
【速读】: 该论文旨在解决机器人在未知环境中遵循自然语言指令完成零样本、长时程、具有隐含时间与逻辑约束的多目标任务这一挑战。现有零样本物体导航方法虽利用视觉-语言模型(VLM)引导前沿探索,但仅适用于单目标任务,无法处理如“清理椅子或沙发,然后打开电视”这类需按特定时序访问多个目标的复杂指令。其核心解决方案在于从两个层面进行创新:在任务层面,采用大语言模型(LLM)将自然语言命令解析为线性时序逻辑(LTL)公式,并转换为确定性有限自动机(DFA),以协调多通道价值图并实现在新物体出现时的动态重规划;在导航层面,提出方向性评分机制,不再生成全局无方向的价值分布,而是对观测图像中的前沿方向进行标注,并从VLM中提取各方向的独立得分,从而提升导航精度与效率。实验结果表明,所提出的ViTL框架可在Habitat-Matterport 3D(HM3D)数据集上实现对带时间约束的自然语言导航任务的零样本长时程完成,且方向性评分显著优于基线方法。
链接: https://arxiv.org/abs/2606.30696
作者: Kaier Liang,Hengde Dai,Cristian-Ioan Vasile
机构: Lehigh University (莱赫大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Enabling robots to follow natural language commands to complete zero-shot long-horizon tasks remains challenging. It requires extracting implicit temporal and logical constraints from natural language commands and executing multiple sub-tasks accordingly. Recent zero-shot object navigation methods use vision-language models (VLMs) to guide frontier-based exploration in unknown environments, but they are limited to single-target tasks. Real-world commands such as “Clean either the chair or the couch, then turn on the tv.” require navigating to multiple targets in a temporally constrained order, which no existing zero-shot system can handle. We present ViTL, a framework that addresses this gap at two levels. At the task level, we use a large language model (LLM) to compile natural language commands into Linear Temporal Logic (LTL) formulas, which are then converted into Deterministic Finite Automata~(DFA) that coordinate multi-channel value maps and trigger dynamic replanning when new objects are detected. At the navigation level, we introduce directional score: rather than producing a direction-agnostic value across the entire field of view, we label frontier directions on the observation image and extract per-direction scores from the VLM. Experiments on Habitat-Matterport 3D (HM3D) show that the full framework enables zero-shot long-horizon completion of natural language navigation tasks with temporal constraints, and that directional score improves single-target navigation accuracy and efficiency over the baseline.
[NLP-79] ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection
【速读】: 该论文旨在解决现有基于语音的阿尔茨海默病(Alzheimer’s Disease, AD)检测系统依赖语音识别(ASR)导致时序结构丢失、过度依赖单一语言数据集且易受录音伪影影响的问题。其核心解决方案是提出一种不依赖ASR的框架,直接在梅尔频谱图(Mel spectrogram)上操作,并通过提取连续频谱帧间的谱时位移场(spectrotemporal displacement fields),捕捉随认知衰退变化的谱能量动态模式,作为数字生物标志物。该方法融合卷积神经网络-卷积门控循环单元(CNN-ConvGRU)生成的声学嵌入与上述位移特征,采用可学习的交叉注意力机制进行多模态融合,并通过带有可学习查询池化的Transformer编码器聚合信息;同时引入复合时间损失以增强片段间平滑性与对比一致性。实验表明,该框架在斯洛伐克、西班牙语语料库中分别达到83.9%和显著优于基线的准确率,而英语基线模型仅53.2%,验证了原有数据集中的已知伪影问题。跨语言消融分析揭示:当多模态信号分布均衡时,融合机制至关重要;当某一模态主导时,融合反而有害;而在无有效信号情况下则无意义。此外,辅助的时间损失收敛至语言无关值,表明模型架构具备跨语言稳定性。
链接: https://arxiv.org/abs/2606.30646
作者: Chukwuemeka Ugwu,Oluwafemi Richard Oyeleke
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia detection systems depend on transcription, discard within-recording temporal structure, and are validated on a single English corpus with known recording artifacts. We propose an ASR-agnostic framework operating directly on Mel spectrograms. Our key contribution is extracting spectrotemporal displacement fields from consecutive spectrogram frames, capturing shifting spectral energy patterns as digital biomarkers of cognitive decline. These features are fused with CNN-ConvGRU acoustic embeddings via a learned cross-attention mechanism and aggregated using a Transformer encoder with learnable query pooling. A composite temporal loss enforces smoothness and contrastive coherence across segments. We train independent models on English DementiaBank, Slovak EWA-DB, and Spanish Ivanova corpora, using clinical elicitation protocols taxing IADL-relevant cognitive domains. The Slovak model achieves 83.9% accuracy, and Spanish achieves, while the English baseline yields 53.2%, confirming known artifacts. Cross-lingual ablation studies reveal distinct fusion regimes: removing cross-attention collapses Spanish performance to 53.7%, below unimodal models, while the Slovak audio encoder alone outperforms the full model, 93.7% vs. 83.9%, and all English configurations remain near chance. Thus, multimodal fusion’s value is corpus-dependent: essential when signal is distributed across modalities, counterproductive when one dominates, and irrelevant when no signal exists. Auxiliary temporal losses converge to language-invariant values, indicating cross-lingual architectural stability.
信息检索
[IR-0] GR2 Technical Report
链接: https://arxiv.org/abs/2606.31984
作者: Yufei Li,Zaiwei Zhang,Mingfu Liang,Kavosh Asadi,Jay Xu,Jimmy Kim,Chongyang Bai,Jieyi Zhang,Hongye Xie,Prachi Agrawal,Dian Yu,Tianyi Chen,Jean-Pascal Billaud,Garret Buell, YK (Yongkang)Zhu,Sachin Patil,Brooke Bian,Zhou Fang,Kevin Huang,Shiva Sudanagunta,Yuzhen Huang,Emma Lu,Chris O’Brien,Yang Song,Lihong Li,Jacob Tao,Zhicheng Zhu,Chao Li,Gaoxiang Liu,Neil Wu,Zhongyin Hu,Li Han,Loki Chen,Ming Lei,Greg Rehm,Siyuan Song,Tianwei Zhang,Li Li,Ketan Singh,Yavuz Yetim,Ilyas Atishev,Satendra Gera,Ashkan Sadeghi,Rachel Yan,Nikko Mizutani,Shuaiwen Wang,Song Yang,Zhijing Li,Jiang Liu,Mengying Sun,Fei Tian,Xiaohan Wei,Chonglin Sun,Parish Aggarwal,Kaushik Rangadurai,Zhi Hua,Frank Shyu,Ruchit Sharma,Liyuan Li,Shike Mei,Wenlin Chen,Santanu Kolay,Ben Schulte,Deepak Chandra,Adam(Yang)Song,Sandeep Pandey,Xi Liu,Hamed Firooz,Luke Simon
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 18 pages, 10 figures
Abstract:Industrial recommendation systems serve billions of users through a multi-stage funnel – retrieval, early-stage ranking, and re-ranking – where the final re-ranking step disproportionately shapes user engagement and downstream performance, particularly for carousel and grid display formats. Despite growing enthusiasm for Large Language Models (LLMs) in recommendation, three gaps hinder industrial adoption: (1) most efforts target retrieval and ranking, leaving re-ranking – the stage closest to the final user experience – largely underexplored; (2) LLMs are typically deployed zero-shot or via supervised fine-tuning, underutilizing the reasoning capabilities unlocked by reinforcement learning (RL) on verifiable rewards; (3) deployed catalogs index billions of items with non-semantic identifiers that lie outside any base-LLM vocabulary. We present GR2 (Generative Reasoning Re-Ranker), an end-to-end framework that combines (i) mid-training on semantic IDs produced by a tokenizer with =99% uniqueness, (ii) reasoning-trace distilled from a stronger teacher via targeted prompting and rejection sampling, and (iii) RL with verifiable rewards purpose-built for re-ranking. To make GR2 resource-viable, we further (iv) introduce a context compressor that amortizes training cost, On-Policy Distillation (OPD) as a scalable alternative to SFT – which we find collapses at industrial scale – and reasoning distillation for low-latency serving. GR2 delivers +18.7% R@1, +7.1% R@3, and +9.6% N@3 over legacy baselines on industrial-scale traffic. We further find that reward design is critical in re-ranking: LLMs often hack rewards by preserving the incoming order or exploiting position bias, motivating conditional verifiable rewards as essential industrial components.
[IR-1] An Open-Source Tool for Reproducible Freeway Network Extraction from OpenStreetMap
链接: https://arxiv.org/abs/2606.31857
作者: Drew Miller,Cathy Wu
类目: Information Retrieval (cs.IR)
备注:
Abstract:Freeway simulation is often difficult to deploy at scale not only because of model formulation, but because preparing road network inputs remains a manual, corridor-specific, and difficult-to-reproduce task. This paper presents an open-source tool that extracts freeway networks from OpenStreetMap (OSM) and converts them into a compact, station-referenced representation suitable for downstream freeway simulation. Unlike existing tools that primarily support arterial or general network conversion tasks, the proposed workflow is designed around the specific requirements of freeway traffic studies. The tool supports not only OSM data cleaning and conversion, but also the broader workflow required in practice: corridor-specific querying, visual inspection of extracted segments, extraction validation against OSM, and source-data validation against aerial imagery. A locally hosted frontend allows users to define corridor-specific queries, select endpoints visually, and inspect extracted segments. The extraction logic is designed to address several recurring challenges in freeway OSM data, including inconsistent route references, ambiguous path selection through interchanges, managed-lane interference, incomplete corridor capture from naive bounding-box queries, and inconsistent ramp classifications. The workflow was first tested on two prototype corridors, where the extract-first-then-validate approach proposed here required roughly one-third the analyst effort of manual ramp encoding from scratch. It was then deployed across 359.6 miles of freeway in Orange County, California, with total processing and validation averaging about 41 seconds per mile. This deployment also suggests that, in a well-mapped region, OSM is sufficiently accurate for many freeway traffic studies. Overall, the tool provides a more scalable and reproducible foundation for freeway network preparation. Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2606.31857 [cs.IR] (or arXiv:2606.31857v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2606.31857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[IR-2] ShopX: A Foundation Model for Intent-to-Item Fulfillm ent in Agent ic Shopping
链接: https://arxiv.org/abs/2606.31693
作者: Jiacheng Chen,Tao Zhang,Manxi Lin,Dunxian Huang,Teng Shi,Honghao Fu,Mengyan Li,Xinming Zhang,Chenchi Zhang,Xuan Lu,Xiaoxiong Du,Haibin Chen,Shaolin Ye,Hao Chang,Xiaoqi Li,Shuwen Xiao,Yujin Yuan,Jingxuan Feng,Shaopan Xiong,Huimin Yi,Ju Huang,Qiu Shen,Ying Chen,Junjun Zheng,Xiangheng Kong,Yuning Jiang
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:The wave of AI-native applications is moving shopping beyond page- and feed-based browsing toward intent-driven experiences orchestrated by LLM agents. A common design wraps an LLM around existing search and recommendation pipelines, forcing complex intents through low-bandwidth retrieval or ranking interfaces and leaving a gap between language understanding and item-space fulfillment. Generative recommendation gives LLMs a direct item-space interface through semantic IDs (SIDs), but existing models mainly generate candidates for retrieval rather than translate flexible intents into item-space outcomes. We propose ShopX to address this bottleneck by unifying intent understanding, execution planning, and flexible SID-native item-space operations into a single foundation model. We deploy ShopX in agentic shopping workflows through a model-native item-fulfillment framework with a serving harness that defines a model-facing action protocol and exposes support surfaces for context access, catalog grounding, and state management. Within this framework, ShopX plans and composes SID-based item-space operations such as SID beam-search retrieval, listwise ranking, or product bundling. This model-centric design reduces lossy hand-offs between agent orchestration and item-space execution. To build ShopX, we design semantically recoverable, LLM-operable SIDs and a training recipe that equips a general LLM for flexible multi-turn item-space fulfillment while retaining the knowledge and instruction-following abilities needed by a shopping agent. We evaluate the ShopX framework against tool-mediated agentic systems on single- and multi-turn fulfillment tasks derived from anonymized Taobao production logs, showing that model-native fulfillment improves overall framework behavior, especially on complex or ambiguous requests.
[IR-3] Unsupervised Data-Efficient Cross-Modal Retrieval with Global-Neighborhood Alignment Hashing
链接: https://arxiv.org/abs/2606.31517
作者: Runhao Li,Xiaoxu Ma,Zhenyu Weng,Yue Zhang,Guibo Luo,Huiping Zhuang,Zhiping Lin,Yap-Peng Tan
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Compared to supervised cross-modal hashing (CMH), unsupervised CMH reduces the reliance on manual labeling by learning binary codes from unlabeled image-text pairs. However, existing unsupervised CMH methods often rely on large-scale image-text pairs, which are costly to collect. To address this limitation, we propose Global-Neighborhood Alignment Hashing (GNAH), a novel approach that preserves the semantic structure of vision-language foundation models within a compact binary Hamming space using only a limited number of image-text pairs. Specifically, GNAH captures global structural information from the continuous latent space and transfers it into the binary Hamming space through a Prototype-Anchored Global Alignment module. In addition, GNAH extends conventional pairwise contrastive learning by modeling stochastic neighborhood relationships via a Contrastive Stochastic Neighborhood Alignment module, thereby alleviating overfitting to sparse pairwise correlations. Extensive experiments demonstrate that GNAH consistently outperforms existing unsupervised cross-modal retrieval methods under data-constrained settings, offering a practical solution for real-world CMH applications.
[IR-4] One Retrieval to Cover Them All: Co-occurrence-Aware Knowledge Base Reorganization for Session-Level RAG ACL2026
链接: https://arxiv.org/abs/2606.31156
作者: Shivam Ratnakar,Yixuan Zhu,Cecilia Cheng,Chaya Vijayakumar
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted to the Towards Knowledgeable Foundation Models (KnowFM) Workshop at ACL 2026
Abstract:RAG systems retrieve documents optimized for answering one query at a time. Yet enterprise users arrive with sessions, that is, coherent episodes of related questions that span semantically distant parts of the knowledge base. We show that a single retrieval call over a standard knowledge base covers only 41% of a user’s session-level information need. To close this gap, we reorganize the KB offline using co-occurrence-aware clustering and expand retrieval candidates through cluster neighborhoods at query time. On WixQA (6,221 enterprise support articles), our method raises single-query session coverage to 58% (+17% absolute; 95% CI: [14.1, 20.4]), reduces retrieval calls to 70% coverage by 34%, and compresses the KB to 20% of its original size, all consistently across four embedding models and six functional domains. We argue that session-level coverage, not single-query recall, should be the primary metric for enterprise RAG evaluation.
[IR-5] Usage frequency and application variety of research methods in library and information science: Continuous investigation from 1991 to 2021
链接: https://arxiv.org/abs/2606.31081
作者: Chengzhi Zhang,Liang Tian,Heting Chu
类目: Digital Libraries (cs.DL); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:
Abstract:The present study analyzed over 26,000 research articles published between 1991 and 2021 in twenty-one major LIS (Library and Information Science) journals, using the machine learning (ML) approach to categorize the research methods used by LIS scholars. The findings of this study are significant. Firstly, there has been a shift in the research strategy from conceptual research (e.g., “Theoretical approach”) to empirical research (e.g., “Interview”) in LIS investigations over the past 31 years. Secondly, the research topics explored by LIS scholars during this period have moved from system-centered issues (e.g., “Information retrieval/models and algorithms”) to user-centered topics (e.g., "Information services "). Thirdly, the study revealed dynamic and revealing relationships between the 18 research topics identified in the study and the 16 research methods commonly adopted in the LIS field. These dynamic relationships can be visualized by year and longitudinally via an interactive map created in this study.
[IR-6] Building a Multimodal Dataset of Academic Paper for Keyword Extraction
链接: https://arxiv.org/abs/2606.31069
作者: Jingyu Zhang,Xinyi Yan,Yi Xiang,Yingyi Zhang,Chengzhi Zhang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:
Abstract:Up to this point, keyword extraction task typically relies solely on textual data. Neglecting visual details and audio features from image and audio modalities leads to deficiencies in information richness and overlooks potential correlations, thereby constraining the model’s ability to learn representations of the data and the accuracy of model predictions. Furthermore, the currently available multimodal datasets for keyword extraction task are particularly scarce, further hindering the progress of research on multimodal keyword extraction task. Therefore, this study constructs a multimodal dataset of academic paper consisting of 1000 samples, with each sample containing paper text, images, audios and keywords. Based on unsupervised and supervised methods of keyword extraction, experiments are conducted using textual data from papers, as well as text extracted from images and audio. The aim is to investigate the differences in performance in keyword extraction task with respect to different modal information and the fusion of multimodal information. The experimental results indicate that text from different modalities exhibits distinct characteristics in the model. The concatenation of paper text, image text and audio text can effectively enhance the keyword extraction performance of academic papers.
[IR-7] Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities
链接: https://arxiv.org/abs/2606.31058
作者: Ziling Chen,Chengzhi Zhang,Heng Zhang,Yi Zhao,Chen Yang,Yang Yang
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注:
Abstract:The composition of author teams is an important factor influencing the novelty of academic papers. However, existing studies have paid limited attention to the role of institutional composition, and most novelty measures remain at a general level, making it difficult to explain the specific sources and types of novelty in papers. Taking the field of natural language processing as an example, this study investigates the relationship between team institutional composition and the fine-grained novelty of academic papers. Author teams are classified into three types: academic institutions, industrial institutions, and mixed academic and industrial institutions. Four types of fine-grained knowledge entities are extracted from full-text papers, including methods, datasets, tools, and metrics. The novelty of papers is then measured based on entity combinations, and pairwise combinations of different entity types are further analyzed to examine their contributions to novel papers. The results show that, in the field of natural language processing, collaboration between industrial and academic institutions is more likely to produce novel papers than purely industrial collaboration. From the perspective of fine-grained knowledge entities, mixed academic and industrial teams pay more attention to the novelty of method-metric combinations, whereas industrial teams pay more attention to the novelty of method-tool combinations. This study reveals the relationship between institutional team composition and paper novelty through fine-grained novelty measurement, providing useful evidence for improving paper quality and promoting industry-academia-research collaboration.
[IR-8] GenPage: Towards End-to-End Generative Homepage Construction at Netflix
链接: https://arxiv.org/abs/2606.31031
作者: Lequn Wang,Jiangwei Pan,Fengdi Che,Linas Baltrunas
类目: Information Retrieval (cs.IR)
备注:
Abstract:We present GenPage, an end-to-end generative approach to Netflix homepage construction that replaces the traditional multi-stage recommender stack with a single transformer. GenPage treats the user and request context as a prompt, and autoregressively generates the entire structured, multi-row homepage as the response. We adapt the LLM training recipe: pretraining on production pages, followed by post-training via weighted binary classification (WBC) or reinforcement learning (RL). For industry-scale deployment, we introduce techniques addressing cold start, model freshness, business-rule enforcement, and serving efficiency. In online A/B tests against a mature, highly optimized production homepage recommender, the WBC variant of GenPage delivered a +0.24% lift on the core user engagement metric we use for launch decisions (p 0.001), while reducing end-to-end serving latency by 20%. Offline, two findings stand out: enriching the prompt yields a larger improvement than scaling model capacity in our current regime, and RL post-training increases homepage diversity even though diversity is not part of the objective.
[IR-9] owards Critical IR Theories and Practices
链接: https://arxiv.org/abs/2606.30984
作者: Bhaskar Mitra
类目: Information Retrieval (cs.IR); Computers and Society (cs.CY)
备注:
Abstract:Belkin and Robertson urged us, half a century ago, to develop a theoretical foundation for understanding what constitutes societal good that can inform information retrieval (IR) research and serve as a basis for determining when we should limit our scientific inquiry in the face of demands that are contradictory to societal good. In this article, I argue that to achieve this, IR should embrace critical theories and practices in our work, and shift away from the dominant liberal frame through which much of the IR community today view societal concerns in context of our research. Unlike the liberal frame, the critical frame explicitly adopts nondomination as its stated goal which can clarify our conceptualization of societal good within the field, provide necessary theoretical underpinning that Belkin and Robertson urged the community to develop, and serve as a basis for critical appraisals of our progress in enacting desired societal change.
[IR-10] Information Terra: A Narrative-Anchored Semantic-First Projection of Document Embeddings IEEE-VIS2026
链接: https://arxiv.org/abs/2606.30824
作者: Brian Keith-Norambuena,Fausto German,Chris North
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 5 pages, 6 figures, accepted in IEEE VIS 2026 as a short paper
Abstract:We introduce Information Terra, a narrative-anchored semantic-first projection that places a document corpus on an Earth-like globe whose poles are two user-chosen endpoint documents and whose prime meridian is the great-circle geodesic between them on the embedding hypersphere – so latitude encodes narrative progress and longitude thematic deviation. Land features are recovered from document density via kernel density estimation and labeled by theme. A narrative trail built from the underlying narrative coherence graph, and constrained to be monotone in geodesic progress, provides a readable storyline. The projection’s axes are semantically grounded in the user’s chosen narrative endpoints, and the globe metaphor affords rotation and antipodal reading. We demonstrate the method on a 540-article Cuban Protests corpus, showing a storyline from Obama’s 2016 visit to the 2021 International Aid during the protests.
人机交互
[HC-0] Investigating LLM -Powered Dissenting Minority Support in Power-Imbalanced Group Decision-Making: Counterargument and Mediation as Intervention Strategies
链接: https://arxiv.org/abs/2606.31762
作者: Soohwan Lee,Seoyeong Hwang,Mingyu Kim,Dajung Kim,Kyungho Lee
类目: Human-Computer Interaction (cs.HC)
备注: Accepted at CSCW 2026
Abstract:Minority viewpoints are often suppressed in power-imbalanced group decision-making due to social pressure to comply with the majority. To address this problem, we developed an LLM-powered dissenting minority support system that aimed to foster attention to minority views through either AI-generated counterarguments or AI-mediated messages. We conducted a mixed-method experiment with 96 participants in 24 groups, comparing minority members’ experiences across baseline, AI-counterargument, and AI-mediated message conditions. Our findings revealed a nuanced trade-off: AI-generated counterarguments fostered a more flexible atmosphere and enhanced satisfaction, while AI-mediated messaging increased minority participation but unexpectedly reduced their psychological safety. This research contributes empirical evidence on how different AI implementations affect group dynamics, identifies a critical support paradox between participation and psychological safety, provides design implications for future systems, and highlights ethical challenges in implementing AI-mediated communication in hierarchical settings. These insights advance understanding of designing more equitable AI support for power-imbalanced group decision-making.
[HC-1] From Idea to Prototype in an Afternoon: Scaffolded AI-Assisted Rapid VA Prototyping
链接: https://arxiv.org/abs/2606.31311
作者: Gennady Andrienko,Natalia Andrienko
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Testing a new visual-analytics idea usually takes months: one needs to find a realistic data set, clean it, and implement an interactive prototype. We describe a case where a workflow language and an AI assistant reduced this effort to one afternoon. The idea under test: relax the Pareto frontier with a tolerance and group the surviving options into recurring types – constellations'' on a soft sky’'. Using the Artifact–Transform Workflow Language (ATWL) as a scaffold, we obtained a consistent workflow in minutes and a running prototype in a few hours. We derive three lessons. The scaffold matters: without ATWL the assistant produced a naive workflow. The scaffold alone is not enough: the first implementation was only average, and expert knowledge injection was needed to reach state-of-the-art quality. Finally, the way the scaffold is used matters: controlled experiments show that a language definition and a library of examples support different aspects of the task, that providing both at once reduces quality because template following displaces creative content, and that scaffolds work best when introduced after an initial unconstrained design pass. We argue that the field needs a typology of human knowledge injection, in a form that is both human-editable and machine-accessible.
[HC-2] AA: A Multi-view Multimodal Dataset for Screen-based Gaze Estimation
链接: https://arxiv.org/abs/2606.31211
作者: Chang Liu,Jiaqi Liu,Zhoutong Ye,Xinjie Shen,Chun Yu,Yuanchun Shi
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注:
Abstract:We present AA, a multi-view multimodal dataset for screen-based gaze estimation. The dataset captures synchronized facial observations from eight fixed screen-mounted cameras and two additional side-view cameras, paired with precise screen-space gaze targets collected under controlled fixation conditions. Each sample contains multi-view face observations together with structured facial region crops, enabling multimodal learning from both global and local visual cues. Unlike existing single-view gaze datasets, AA provides multi-view coverage from both screen-mounted and side-mounted perspectives, enabling more robust modeling under viewpoint variation and occlusion. The dataset includes subject-independent evaluation splits and a standardized data processing pipeline to support reproducible research in gaze estimation.
[HC-3] What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR INTERSPEECH2026
链接: https://arxiv.org/abs/2606.31112
作者: Hawau Olamide Toyin,Srinivasan Umesh,Hanan Aldarmaki
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 5 pages, 2 figures, accepted at Interspeech 2026
Abstract:ASR systems have been often reported to underperform on atypical speech. An often conflated compounding factor is the existence of two valid transcription references: verbatim (actual produced speech, including repetitions/prolongations) and intended (the canonical form of the text with disfluencies removed) in atypical speech recognition depending on context and use-case. Most ASR evaluations conflate this duality into a single ground truth and reward systems that delete disfluencies, ignoring verbatim faithfulness. We benchmark 11 ASR models from encoder-decoder, CTC and transducer families using both verbatim and intended references on atypical stuttered speech as a case study. Our quantitative assessment underlines the disparity in model performance and rankings using the two transcript styles. Through this analysis, we highlight the importance of selecting a suitable transcription reference for valid model selection depending on the use-case, particularly for atypical ASR.
[HC-4] Evaluating Interactivity: Toward Automated Assessment of AI-Generated Explorable Explanations
链接: https://arxiv.org/abs/2606.31012
作者: Xiaozao Wang,Zhewei Wang,Hongyi Wen
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:While large language models now enable rapid generation of interactive learning materials, evaluating the interaction quality of these explorable explanations remains an open challenge. Existing benchmarks largely focus on code executability or visual fidelity, providing limited insight into dynamic interaction behaviors such as learner-controlled state transitions and context-sensitive system responses, which are factors that critically shape learners’ conceptual understanding. We present EE-Eval, an automated evaluation framework that formalizes interactivity as a finite space of learner-controllable states and transitions, represented as a Finite State Machine (FSM). By extracting FSMs from AI-generated explorable explanations, EE-Eval externalizes implicit interaction logic into an explicit, machine-interpretable graph. Evaluation is performed by comparing each generated FSM to an ideal FSM that encodes pedagogical intent, using a combination of graph-based metrics and embedding-based comparison of states, actions, and feedback to measure their structural and semantic similarity. Across thousands of generated explorable explanations spanning 127 concepts and produced by 6 AI models, EE-Eval consistently differentiates interaction quality beyond surface-level criteria such as functional correctness or visual quality, and exhibits substantially stronger alignment with human judgments of interactivity and pedagogical effectiveness than existing baselines. By framing interactivity as testable behavioral models rather than an emergent byproduct of LLM generation, EE-Eval transforms evaluation into a reflective diagnostic tool, enabling pedagogically grounded and actionable human-AI collaboration in creating interactive educational content.
[HC-5] Ethics and Social Responsibility in AI-Assisted Interviewing: An LLM -in-the-Loop Study of AI-Generated Follow-Up Questions
链接: https://arxiv.org/abs/2606.30980
作者: He Zhang,Yueyan Liu,Xin Guan,Jie Cai,John M. Carroll
类目: Human-Computer Interaction (cs.HC)
备注: This work has been accepted to CHIWORK '26
Abstract:Semi-structured interviews rely on timely, context-sensitive follow-up questions, yet interviewers’ cognitive load and limited domain familiarity can constrain probing depth. We report findings from an LLM-in-the-loop Wizard-of-Oz (WoZ) study that simulates an AI follow-up assistant in live interviewing while preserving human oversight. In our setup, a co-interviewer selectively relayed and could edit AI-generated follow-up questions (AGQs) produced in real time by GPT-4o, enabling a realistic approximation of deployment without fully automating the interaction. Across 17 interviewers with varied qualitative-method expertise, participants raised five interlocking concerns: (1) harmful or discriminatory language and unpredictable interaction harms, (2) undermining interviewees’ sense of respect through divided attention and missing nonverbal cues, (3) technology-based participation inequality, (4) unclear responsibility when harms occur, and (5) privacy, disclosure, and compliance risks when AI listens, records, or transcribes sensitive content. We translate these concerns into design and governance implications for safer, more respectful, and more accountable AI-assisted interviewing.
[HC-6] Anthropomorphism in AI Companion Communities: Age Gender and Emotional Correlates
链接: https://arxiv.org/abs/2606.30942
作者: Afia Mubashir,Boden Moraski,Stephanie Choi,Rose E. Guingrich
类目: Human-Computer Interaction (cs.HC); Computers and Society (cs.CY)
备注:
Abstract:Artificial intelligence (AI) systems are increasingly integrated into daily life, with millions now using AI chatbots built on Large Language Models (LLMs) for companionship. Both humanlike AI qualities and user predispositions to anthropomorphize relate to social consequences, such as increased trust, social health benefits, and psychological harms. Populations such as children, older adults, or those with mental health vulnerabilities may be particularly susceptible to anthropomorphism and its detriments, but mixed findings complicate the role of demographics. We used publicly available Reddit data from three popular AI companion subreddits to assess relationships between gender, age, anthropomorphism, and elicited emotions, to better understand how different people perceive and are affected by AI companions. We investigated three questions: How do age and gender relate to anthropomorphization of AI?, How does emotional expression relate to anthropomorphization?, and How do age and gender moderate emotion-anthropomorphization relationships? We found that adults and women anthropomorphize AI chatbots more than teens and men, and that positive emotional expression, particularly joy, is positively associated with anthropomorphization, while neutrality is negatively associated with anthropomorphism. Both relationships were stronger in adults than teens. Our findings suggest that the tendency to anthropomorphize may be more broadly distributed across age groups than previously expected, thereby prompting the reevaluation of existing digital safety norms.
[HC-7] Debugging as Evidence-Driven Reasoning : Visualization Opportunities in Data-Intensive Programming IEEE-VIS
链接: https://arxiv.org/abs/2606.30884
作者: Yongbo Chen,Yan Zhu,Rebecca Faust
类目: Human-Computer Interaction (cs.HC)
备注: 5 pages, 1 figure, submitted to IEEE VIS conference
Abstract:Visualization has been recognized as a valuable means of supporting debugging by externalizing runtime behavior that would otherwise remain hidden or scattered. However, most visual debugging research has focused on traditional software development settings, leaving the distinct challenges of data-intensive workflows largely uncharacterized. To build visual debugging support for these settings, we first need to characterize how practitioners debug in these settings and translate their challenges into concrete visualization opportunities. To this end, we conducted semi-structured interviews with nine participants from diverse data-intensive domains and analyzed the data using thematic analysis. Our analysis reveals three cross-cutting challenge: assembling fragmented evidence, detecting expected-observed discrepancies, and tracing state evolution across workflow components. We distill these challenges into three concrete requirements that current debuggers support only partially but that visualization is well suited to address: cross-artifact evidence alignment, expectation-grounded comparison, and traceable state evolution. Together, these requirements begin to characterize a design space for future visual debugging research in data-intensive programming.
[HC-8] Neural Signatures of Programming Expertise: Classifying Programmer Skill Levels Using EEG Data
链接: https://arxiv.org/abs/2606.30879
作者: Maurice Rekrut,Mahima Mahabaleshwar Acharya,Taisiia Ulianova,Norman Peitek,Annabelle Bergum,Mariya Toneva,Sven Apel,Antonio Krüger
类目: Human-Computer Interaction (cs.HC)
备注:
Abstract:Accurately assessing a programmer’s skill level is critical for hiring, team composition, and performance evaluation in the software industry. Conventional methods, such as coding tests or interviews, often fail to capture the full spectrum of cognitive abilities underlying programming expertise. This study explores using electroencephalography (EEG) and machine learning to investigate neural correlates of programming skill. We analyzed an existing EEG dataset recorded during code comprehension from 37 programmers with 1 to 30 years of experience (8.1 +/- 6.3 years) to examine relationships between neural activity and expertise. Additionally, we conducted classification experiments using Random Forest classifiers with diverse features for binary (experts vs. novices) and multi-class (experts, intermediates, novices) this http URL identified EEG features and brain regions associated with programming expertise. Specifically, EEG entropy showed the strongest correlation with skill level. Furthermore, experts’ brains were characterized by highly localized centro-frontal activation, whereas frontal activation in other groups was part of a more distributed network. Regarding classification, our setup achieved an average accuracy of 91.83% (binary) and 78.15% (multi-class) in stratified 10-fold cross-validation, while leave-one-subject-out validation achieved 85.00% and 58.80%, respectively. Individual frequency bands outperformed full-spectrum analyses, and both program comprehension and resting-state data yielded strong results. These findings demonstrate that EEG features effectively capture neural correlates across different skill levels and highlight the potential of neural data to complement traditional methods of skill assessment.
[HC-9] Drawing Out Legal Risks: Co-Designing with Lawyers to Predict and Manage Legal Uncertainties of Medical AI Tools
链接: https://arxiv.org/abs/2606.30828
作者: Gennie Mansi,Julia Kim,Michael Rosenbloom,Mark Riedl
类目: Human-Computer Interaction (cs.HC)
备注: 13 pages, 8 figures; Note: This paper is formatted as a submission type called a Pictorial, used in some ACM venues (e.g. this https URL ). Pictorials present visual components (e.g. study artifacts, diagrams) with text to convey contributions. We present co-created visualizations from our study alongside our analysis/. For more justification of the format, see page 4
Abstract:While there’s optimism around medical AI tools due to their abilities to adapt from user-to-user and across environments, these new abilities complicate how people and organizations are able to predict and manage risk based on existing laws and regulations. Lawyers are trained to identify potential legal outcomes, but they lack technical AI knowledge, making it difficult to translate their expertise to creators and users of AI tools. We contribute insights from our co-design process with U.S. lawyers to identify and translate ways to predict and manage risks of medical AI tools. We present the visualizations we developed through two years of cross-disciplinary efforts and thereby illustrate our findings about how legal risks are determined and our strategies for people and organizations to predict and manage these risks. We offer insights about leveraging lawyers’ expertise to understand, predict, and manage legal risks.
[HC-10] Improving Survey Participation in Low-Literacy Populations Through Value-Sensitive Conversational AI ECAI2026 IJCAI
链接: https://arxiv.org/abs/2606.30660
作者: Raj Gaurav Maurya
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted at the IJCAI-ECAI 2026 AI and Social Good Track
Abstract:Collecting reliable social data from low-literacy populations remains a persistent challenge, particularly when surveys involve sensitive topics and marginalized communities. Traditional paper-based and web-based survey modalities often suffer from high attrition and incomplete responses due to literacy barriers, social pressure, and interactional discomfort. In this paper, we present findings from an initial field evaluation comparing multiple survey modalities paper-based interviews, digital web-based surveys, conversational AI (convAI) surveys, and convAI enhanced with layered value-sensitive design conducted with low-literacy women across India. Using data from 315 participants, we show that convAI significantly improves survey completion rates relative to traditional modalities, with the highest completion and lowest drop-off observed when value-sensitive and culturally aligned conversational design elements are fully integrated. These results demonstrate the importance of human-centered and value-sensitive interaction design in enabling inclusive, ethical, and scalable data collection; motivating more `AI for social good’ applications.
计算机视觉
[CV-0] FaceMoE: Mixture of Experts for Low-Resolution Face Recognition ECCV2026
链接: https://arxiv.org/abs/2606.32040
作者: Kartik Narayan,Vishal M. Patel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026, Project Page: this https URL
Abstract:Low-resolution face recognition (LR-FR) remains a challenging task due to poor feature extraction and aggregation, as probe images often contain limited identity information resulting from extreme degradations such as blur, occlusion, and low contrast. Additionally, the domain gap between high-resolution (HR) gallery images and low-resolution (LR) probe images poses a significant challenge. A single feature encoder struggles to generalize effectively across both domains when fine-tuned on an LR dataset, and this issue is further magnified by catastrophic forgetting. To address these challenges, we propose FaceMoE, an effective adaptation of Mixture of Experts (MoE) transfomer architecture for low-resolution face-recognition . Specifically, we introduce multiple specialized feed-forward network (FFN) experts and incorporate a top-k router, which dynamically assigns tokens to appropriate experts. This design emergently promotes specialization across experts for different semantic regions of the face, which enables FaceMoE to perform resolution-aware feature extraction. Moreover, the top-k router facilitates sparse expert activation, enabling the model to preserve pretrained knowledge when finetuned on a LR dataset, while increasing model capacity without proportional computational overhead. FaceMoE is trained with a combined face recognition loss, router z-loss, and load balancing loss to ensure expert specialization and stable training. To the best of our knowledge, this is the first work leveraging MoE for LR-FR. Extensive experiments across eleven datasets, spanning HR, mixed-quality, and LR benchmarks, demonstrate that FaceMoE significantly outperforms state-of-the-art methods. Code: this https URL
[CV-1] GEAR: Guided End-to-End AutoRegression for Image Synthesis
链接: https://arxiv.org/abs/2606.32039
作者: Bin Lin,Zheyuan Liu,Chenguo Lin,Sixiang Chen,Yunyang Ge,Yunlong Lin,Jianwei Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Li Yuan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual generative models are typically trained in two stages. A tokenizer is first trained for reconstruction and then frozen, after which a generator is trained on its discrete indices or continuous latents. This decoupling leaves the tokenizer unaware of what the generator finds easy to model. We present GEAR (Guided End-to-end AutoRegression), which trains a vector-quantized (VQ) tokenizer and an autoregressive (AR) generator jointly and end-to-end, guided by representation alignment. The key obstacle is that the VQ index fed to the AR model is non-differentiable, so gradients cannot reach the tokenizer, and a straight-through estimator collapses. GEAR resolves this with a dual read-out of the codebook assignment. A hard, one-hot branch trains the AR with next-token prediction, while a differentiable soft branch carries a representation-alignment loss that flows back to guide only the tokenizer. The AR model thereby steers its tokenizer toward an index distribution it can predict more easily. This shifts the alignment burden from the tokenizer to the AR: the tokenizer’s own features become less DINOv2-like while the AR’s become more so, the opposite of diffusion-side recipes that make the latent itself semantic. GEAR speeds up ImageNet gFID convergence by up to 10x relative to the strong LlamaGen-REPA baseline, learns markedly better patch-level and spatially-coherent features, and generalizes across quantizers (VQVAE, LFQ, IBQ) and to text-to-image generation.
[CV-2] PointSplat: Compact Gaussian Splatting via Human-Centric Prediction
链接: https://arxiv.org/abs/2606.32036
作者: Yujie Guo,Yudong Jin,Lingteng Qiu,Zehong Shen,Zhen Xu,Jing Zhang,Xianchao Shen,Hujun Bao,Sida Peng,Xiaowei Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Producing 3D human representations from input views on the fly is essential for immersive live streaming systems, where representation compactness is as critical as high fidelity given limited computational power and transmission bandwidth. Although recent feed-forward reconstruction methods achieve impressive quality through the view-centric prediction of 3D representations, they repeatedly encode the same subject content across multiple views, leading to significant inter-view redundancy. Our key insight is to perform predictions directly in 3D space, enabling the network to learn and produce a highly compact representation. To this end, we propose PointSplat, a novel human-centric approach that directly infers Gaussian primitives from an input point set. The proposed method first estimates a coarse geometric proxy and performs ray casting to prune redundant points and establish explicit 2D–3D correspondences. Subsequently, it employs a Point-Image Transformer to fuse appearance and geometry features, predicting Gaussian attributes in a single forward pass. This design restricts predictions to foreground regions of interest, substantially reducing the total number of Gaussians while improving novel-view rendering quality. Extensive experiments demonstrate that PointSplat achieves higher efficiency and quality while exhibiting strong robustness to variations in view count and image resolution across multiple datasets.
[CV-3] SpheRoPE: Zero-Shot Optimization-Free 360 Panorama Generation with Spherical RoPE
链接: https://arxiv.org/abs/2606.32033
作者: Or Hirschorn,Aaron Olender,Eli Alshan,Ianir Ideses,Lior Fritz,Sagie Benaim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We present a zero-shot, training-free and optimization-free framework for generating 360 panoramic images and videos by directly injecting spherical priors into pre-trained diffusion transformers. Existing methods either rely on costly fine-tuning on scarce panoramic data that limits generalization, or leverage multi-step optimization that incurs prohibitive inference latency. We observe that contemporary generative models natively exhibit some panoramic priors from large-scale training. However, these emergent capabilities are insufficient, as the models fundamentally fail to satisfy the rigorous topological constraints imposed by equirectangular projection (ERP). We introduce a zero-shot and optimization-free approach that resolves these constraints at inference time. Spherical RoPE replaces standard rotary position embeddings: low-frequency channels are re-parameterized as 3D Cartesian coordinates to natively encode the spherical manifold, while high-frequency channels are harmonically quantized to enforce exact periodicity. Coupled with complementary Semantic Distortion classifier-free guidance (CFG) that explicitly steers geometry, we avoid retraining and inherit the full creative breadth of state-of-the-art models. Our approach generalizes across diverse backbones and 360 generation modalities. We demonstrate this across text-to-panorama using Flux.1, Flux.2, and LTX-Video backbones, achieving competitive performance against baselines, all while remaining training-free. Project page: this https URL
[CV-4] FLORA: A deep learning approach to predict forest attributes from heterogeneous LiDAR data
链接: https://arxiv.org/abs/2606.32023
作者: Emilie Vautier,Clément Mallet,Cédric Vega
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Forest attributes are essential for national-scale resource monitoring. Airborne LiDAR metrics are among the auxiliary variables most strongly correlated with forest attributes used in National Forest Inventory (NFI) estimates. However, producing wall-to-wall predictions remains challenging when LiDAR data are acquired under heterogeneous conditions. As national LiDAR programs expand across Europe, variability in sensors, flight parameters, seasons, and scan angles limits the robustness of existing models, which are often calibrated for local conditions. We present FLORA (Forest LiDAR Octree Regression with Auxiliary Data), a deep learning framework that predicts six forest attributes: dominant height, total volume, deciduous volume, coniferous volume, basal area, and stem density from heterogeneous LiDAR point clouds. FLORA combines an octree-based backbone with ecological and spatiotemporal auxiliary variables through a late-fusion gating mechanism. Models are trained and evaluated on 32,052 National Forest Inventory plots across mainland France using data from the French LiDAR HD program. A single model trained on both leaf-on and leaf-off acquisitions outperforms season-specific models and improves cross-season robustness. Auxiliary variables provide modest overall gains but contribute more strongly to species-specific volume prediction. FLORA achieves an rRMSE of about 12.3% (R2 = 0.88) for dominant height and 39% (R2 = 0.74) for total volume, providing a robust baseline for large-scale forest attribute estimation from heterogeneous national LiDAR programs.
[CV-5] Cross-Space Distillation: Teaching One-Step Students with Modern Diffusion Teachers ECCV2026
链接: https://arxiv.org/abs/2606.32020
作者: Anh Nguyen,Ngan Nguyen,Duc Vu,Trung Dao,Viet Nguyen,Quan Dao,Kien Nguyen,Chi Tran,Phong Nguyen,Khoi Nguyen,Cuong Pham,Dimitris Metaxas,Vishal M. Patel,Anh Tran
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Modern one-step diffusion models achieve impressive quality through distribution-based timestep distillation. Yet, they rely on a critical assumption: Teacher and Student must inhabit the same latent space. This Shared-Space constraint prevents knowledge transfer from modern high-capacity Teachers (e.g., SD 3.5 and Flux) into compact, deployment-friendly Students such as SD 1.5, whose latent resolution and VAE parameterization differ from the Teacher. We formalize this overlooked regime as Cross-Space Distillation, where Teacher and Student differ in both latent resolution and VAE space. To enable distillation under this mismatch, we introduce the Bridge, a lightweight latent interface that maps Student latents into the Teacher space without modifying the Student backbone. Bridge combines a frozen Student VAE decoder as a spatial prior with a compact learnable projector, and is trained with latent reconstruction and attention fidelity objectives for stable Teacher-space alignment. Across diverse modern Teachers, Bridge enables substantial gains for compact one-step Students; for example, it improves SD 1.5 from 5.4 to 9.4 HPSv3 while preserving one-step inference, low latency, and broad ecosystem compatibility. These results show that heterogeneous large Teachers can be distilled into efficient, deployable backbones through a lightweight latent-space interface.
[CV-6] Automated Background Swapping for Robustness against Spurious Backgrounds
链接: https://arxiv.org/abs/2606.32018
作者: Cesar Roder,Kajetan Schweighofer
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Classifiers based on Deep Neural Networks exhibit strong performance across domains, yet can fail catastrophically if they rely on spurious correlations, i.e., features that are predictive of the target label in the training data but are not causally linked and thus fail to generalize. For the vision domain, many such spurious correlations manifest themselves within the background of the image, where only the foreground is predictive of the class label. In this paper, we introduce Automated Background Swapping (AutoBackSwap) to reduce the reliance of classifiers on such spurious backgrounds. AutoBackSwap uses a secondary network to disentangle the foreground and background, followed by infilling to synthesize complete backgrounds, and finally combines different foregrounds and inpainted backgrounds to augment the training data. We find that patch-wise labeling of just a few hundred samples suffices to train the secondary network and automatically augment the full training dataset on challenging image classification tasks. In contrast to many previous methods, AutoBackSwap proves very effective even if there is not a single sample in the training data breaking the spurious correlation. Across a range of image classification tasks with spurious backgrounds, AutoBackSwap consistently outperforms prior methods.
[CV-7] CoMet: Context and Multiplicity Decomposition for Multimodal Uncertainty Estimation
链接: https://arxiv.org/abs/2606.32012
作者: Sanghyuk Chun,William Yang,Amaya Dharmasiri,Olga Russakovsky
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 33 pages, 13.3MB
Abstract:Uncertainty estimation has been a long-standing challenge in AI models; it amounts to “knowing what you don’t know,” and metacognition is notoriously difficult even for humans (cf. the Dunning-Kruger effect). Although it is still far from solved even in simpler classification systems, tackling it in multimodal large language models (MLLMs) is becoming increasingly important. Within MLLMs, uncertainty can stem from any of the diverse sources as well as from their relationships, and further can stem from the unbounded answers in the open-ended setting. To tackle the issues, we propose CoMet, an MLLM uncertainty estimation method by decomposing uncertainty into a context-specific term and a multiplicity-specific term. The former captures ambiguity induced by the given context (e.g., task or prompt), while the latter captures how many plausible answers determined by the context remain compatible with the given input. We train a lightweight post-hoc uncertainty module to estimate these quantities, which enables efficient uncertainty estimation without autoregressive answer generation or repeated sampling. Experiments on various open-ended multimodal benchmarks, hallucination detection, and multiple-choice visual question answering benchmarks show that CoMet consistently improves uncertainty estimation over existing baselines while remaining efficient in practice. Code is available at this https URL
[CV-8] CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts COLT ECCV2026
链接: https://arxiv.org/abs/2606.31986
作者: Lianyu Hu,Shengqian Qin,Zeqin Liao,Qing Guo,Liang Wan,Wei Feng,Yang Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV2026. Code is available at this https URL
Abstract:Chain-of-thought (CoT) reasoning has enabled multi-modal large language models (MLLMs) to tackle complex visual reasoning tasks by generating explicit intermediate reasoning steps in natural language. However, this text-based reasoning paradigm is inherently slow at inference time with even thousands of tokens and fundamentally constrained by the expressiveness of natural language. In this paper, we propose CoLT, (Chain of Latent Thoughts), a novel framework that teaches multi-modal models to reason through a chain of latent thought representations instead of verbose text tokens, which can perform thinking with as few as 3 steps. Naively forcing the model to think with latent states easily produces meaningless semantics and makes training unstable. To effectively regulate the latent reasoning process, we introduce a lightweight external decoder that provides step-level supervision for each latent reasoning step in two complementary directions: a forward mode that decodes latent thoughts into the textual reasoning of the next step, and a backward mode that aligns decoder hidden states with the model’s latent thoughts given preceding textual context. We further incorporate internal supervision that encourages coherent step-by-step latent transitions. The decoder and internal supervision are removed during inference to maintain high efficiency of latent reasoning. Extensive experiments on eight benchmarks demonstrate that CoLT not only outperforms existing latent reasoning methods such as CODI and SIM-CoT, but also surpasses latent visual reasoning approaches that rely on auxiliary images with costly annotation requirements. Compared to text CoT methods, CoLT can notably reduce the inference time by 10.1 \times and text decoding time by 22.6 \times . Code is released at this https URL.
[CV-9] ERA: Entropy-Guided Visual Token Pruning with Rectified Attention for Efficient MLLM s
链接: https://arxiv.org/abs/2606.31982
作者: Yuhao Wang,Mu Qiao,Haiwen Diao,Yunzhi Zhuge,Pingping Zhang,Xindong Zhang,Lei Zhang,Huchuan Lu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 7 figures
Abstract:Multimodal Large Language Models (MLLMs) incur prohibitive inference costs due to long visual token sequences. Training-free visual token reduction provides an efficient solution. However, existing methods distort attention distributions, giving rise to a phenomenon we term Attention Logit Collapse. To address this issue, we propose ERA, an Entropy-guided visual token pruning framework with Rectified Attention for efficient MLLMs. Specifically, ERA comprises three crucial components: Dual-view Entropy Pruning (DEP), Bias-aware Token Recycling (BTR), and Logit-preserving Attention Rectification (LAR). First, DEP identifies representative anchor tokens by jointly modeling visual diversity and head-wise saliency. BTR then recycles pruned tokens into their corresponding anchors while estimating a cluster-level logit bias. Building upon this, LAR injects the estimated bias into attention logits, effectively rectifying the collapse induced by token reduction. Together, these components preserve visual evidence even under aggressive compression, enabling robust performance across single-image, multi-image, and video settings on a wide range of MLLMs. Beyond delivering practical acceleration, ERA establishes logit-preserving visual token pruning as a principled framework for efficient MLLMs, unifying theoretical foundation, algorithmic design, and practical deployment. The code is at this https URL.
[CV-10] LUNA: Learning Universal 3D Human Animation Beyond Skinning ECCV2026
链接: https://arxiv.org/abs/2606.31981
作者: Peng Li,Rawal Khirodkar,Junxuan Li,Yuan Dong,Chen Cao,Yuan Liu,Wenhan Luo,Yike Guo,Shunsuke Saito
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ECCV 2026, Project page: this https URL
Abstract:Creating photorealistic, animatable 3D human avatars from monocular images still largely depends on Linear Blend Skinning (LBS) and parametric body models, which constrain expressivity and often introduce artifacts due to imperfect fitting. We propose LUNA, an LBS-free universal neural animation model that directly maps multiple 2D controls like images, keypoints, sketches, and unseen characters into 3D Gaussian deformations, bypassing explicit body fitting. At its core, a transformer-based motion regressor disentangles global rigid motion from fine-grained local dynamics to capture both coherent movement and subtle non-rigid effects. To resolve the inherent ambiguity of 2D-to-3D lifting while scaling beyond fitted datasets, we introduce hybrid supervision that distills soft structural priors from an LBS teacher and a loss that supports training on both limited fitted data and large in-the-wild unlabeled videos. Extensive experiments show LUNA achieves competitive visual fidelity compared to LBS-based approaches, while delivering realistic human motion and zero-shot cross-identity generalization across diverse driving modalities. To the best of our knowledge, LUNA is the first end-to-end 3D animatable model that supports implicit 2D driving.
[CV-11] Planar-SfM: Camera Pose Estimation via Homography Graph Embeddings
链接: https://arxiv.org/abs/2606.31979
作者: Gabi Pragier,Matan Karklinsky,David Ungarish,Avi Ben-Cohen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Structure from Motion (SfM) systems traditionally struggle with planar scenes, where standard epipolar geometry-based methods become degenerate. Rather than viewing planar surfaces as a limitation, we propose a unified framework that leverages them as a source of geometric constraints. Our key insight is that each planar surface visible across multiple views provides an independent estimate of relative camera poses through homography decomposition. By aggregating estimates from multiple planes or even from a single dominant plane we achieve robust pose recovery in scenarios where traditional methods fail. We introduce a novel graph-based approach that constructs a pose-graph from homography estimates and employs spectral embedding to identify and filter unreliable edges. Our method maps homography-based pose estimates onto the real line based on their geometric and visual consistency, enabling efficient extraction of a maximally consistent spanning tree for pose recovery. This approach naturally handles both highly planar scenes, such as indoor sports arenas, and general 3 D environments. We demonstrate superior performance on basketball court imagery where existing methods struggle, while matching or exceeding state-of-the-art results on unconstrained outdoor scenes from the IMC Phototourism benchmark.
[CV-12] AnyBokeh: Physics-Guided Any-to-Any Bokeh Editing with Optical Fingerprint Transfer
链接: https://arxiv.org/abs/2606.31959
作者: Xinyu Hou,Xiaoming Li,Zongsheng Yue,Chen Change Loy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Depth-of-field control is a fundamental tool in photography, yet post-capture bokeh editing from a single image remains challenging. A practical editor should handle images captured under arbitrary focus and aperture settings. Existing methods typically assume an all-in-focus input, or first recover an all-in-focus image before rendering new bokeh. Such pipelines can discard useful blur cues from the source image and propagate reconstruction artifacts into the final edit. We introduce AnyBokeh, a physics-guided framework for any-to-any bokeh editing. Instead of treating source blur merely as a degradation to be removed, AnyBokeh estimates the source blur state with a signed circle-of-confusion map and a disparity map. By modeling the linear relation between signed circle of confusion and disparity difference, AnyBokeh estimates a source-specific optical fingerprint and transfers the source optical characteristics to the desired focus and aperture setting. A generative editor conditioned on both source and target circle-of-confusion maps then performs relative blur synthesis, enabling spatially adaptive deblurring, preservation, and defocus rendering. To support physically supervised learning, we further construct a high-fidelity synthetic dataset with accurate depth, focus distance, and full EXIF metadata. Experiments on real-world benchmarks show that AnyBokeh achieves faithful and controllable editing across any-to-any bokeh editing, all-in-focus-to-bokeh rendering, and defocus deblurring, while avoiding all-in-focus reconstruction and test-time bokeh-level calibration commonly required by existing approaches. The code and dataset will be available at this https URL.
[CV-13] DEMUN: Fast and accurate discovery of music notation in very large collections
链接: https://arxiv.org/abs/2606.31956
作者: Vojtěch Dvořák,Filip Bím,Jiří Mayer,Martina Dvořáková,Markéta Herzanová Vlková,Pavel Pecina,Petr Žabička,Jan Hajič jr
类目: Digital Libraries (cs.DL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Much of written musical heritage is preserved and digitised at memory institutions: libraries, museums, and archives. Owing to their collection structures, sheet music tends to be concentrated in large subsets that are defined as collections of music, with corresponding metadata that makes the music findable. However, when studying musical life as opposed to individual works, relevant documents often lie outside of these specialised collections: in textbooks, newspapers, other periodicals, pamphlets, and other documents with extensive circulation. But these documents are typically not catalogued as musical documents, and though there may be a lot of such documents overall, in large library collections, they are still extremely sparse. Manual discovery is thus unfeasible. Automated discovery requires an extremely low false positive rate in order to be useful, and must also operate quickly. We present DEMUN: a two-stage lightweight detector of music notation with a false positive rate of 0.015 %. In the test scenario, 4 million images of a national-scale library were processed, out of which 1,500 pages with music notation were discovered, suggesting the entire collection may contain up to 20-30,000 unmarked documents of musical life.
[CV-14] World Narrative Model for Highly Controllable Video Generation: A Paradigm Shift from Pixel Sampling to Physical World Orchestration
链接: https://arxiv.org/abs/2606.31946
作者: Ye Chen,Xuanhong Chen,Yupeng Zhu,Liming Tan,Zhewen Wan,Yuxuan Xiong,Tielong Wang,Jinfan Liu,Wuze Zhang,Xiongzhen Zhang,Feifei Li,Xianglin Luo,Zhehan Zhao,Zhifan Zhang,Laisheng Kou,Zhujing Liang,Yugang Chen,Muchun Chen,Xu Miao,Yijing Zhang,Xiaojie Sheng,Qiang Hu,Jialiang Chen,Weimin Zhang,Wenjun Zhang,Bingbing Ni
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The fundamental obstacle to industrial grade video generation is the lack of controllability: existing models treat video as a pixel distribution sampling problem, bypassing the explicit, instance level 4D (3D + T) physical world. Consequently, content creators cannot specify geometry, motion, camera parameters, or lighting in a deterministic, quantitative way, leading to the infamous ‘‘gacha’’ loop that makes professional content creation prohibitively inefficient and expensive. To address this, we introduce the World Narrative Model (WNM), a paradigm that decouples what to render – the structured physical narrative – from how to render – the pixel generation process. WNM replaces end-to-end black-box sampling with orchestrated 4D pre-visualization for media generation. Collaborative agents translate sparse multimodal inputs, including text, reference videos, and sketches, into a fully editable world representation with scene geometry, object layouts, character/animal skeleton motion, trajectories, camera motion, and lighting at quantitative, physically meaningful granularity. This representation acts as a deterministic structural blueprint that drives existing video foundation models, either frozen or lightly adapted, to render final footage, turning the base model into a faithful neural shader. Built on this engine, our human-AI platform supports automatic world generation and pre-visualization aligned with professional filmmaking pipelines, while director consoles enable seamless human refinement. Experiments show that WNM greatly reduces probabilistic ``gacha’’ calls and produces videos whose layout, motion, and cinematography closely follow creator intent. The framework is open and modular, allowing each component, such as world representation, control agents, and adapters, to be independently improved. Project website: this https URL.
[CV-15] FlexViT: A Flexible FPGA-based Accelerator for Edge Vision Transformers
链接: https://arxiv.org/abs/2606.31938
作者: Hubert Dymarkowski,Xingjian Fu,Rappy Saha,Jude Haris,José Cano
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
备注: Accepted to 36th International Conference on Field-Programmable Logic and Applications (FPL) 2026
Abstract:Deploying Vision Transformer (ViT) models on edge platforms remains challenging due to their high computational demands and the architectural heterogeneity of modern hybrid ViT models, which incorporate both fully connected and convolutional layers. This heterogeneity leads to significant variation in tensor shapes, requiring flexible and efficient FPGA-based acceleration. In this paper, we present FlexViT, a reconfigurable FPGA accelerator for efficient ViT inference on resource-constrained edge devices. Built on the SECDA-TFLite framework, FlexViT employs a hardware-software co-design approach that maps both fully connected and convolutional layers onto a unified high-throughput INT8 GEMM engine using a runtime im2col transformation. To efficiently support diverse layer configurations, we propose a dual-mode dataflow that dynamically switches between input and weight reuse by reconfiguring the compute array at runtime. We further introduce a depth-first tiling strategy that completes accumulation in a single pass, eliminating off-chip partial-sum transfers and reducing memory bandwidth requirements. We implement FlexViT on a PYNQ-Z2 FPGA and evaluate it across a representative set of ViT models. FlexViT achieves up to 2.74x speedup on accelerator-executed layers, translating into up to 1.40x end-to-end speedup compared to CPU-only execution. The code is available at: this https URL
[CV-16] No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs ECCV2026
链接: https://arxiv.org/abs/2606.31933
作者: Haojian Huang,Harold Haodong Chen,Meng Luo,Junjia Du,Shanqing Xu,Ziheng Chen,Yanxiang Huang,Yinchuan Li,Ying-Cong Chen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:We introduce VidPair-Halluc, a new benchmark for evaluating video hallucination in large video models (LVMs) under rigorous and controlled conditions. Unlike previous benchmarks that primarily rely on text-based perturbations or adversarial questions while neglecting the consistency of visual backgrounds, VidPair-Halluc features video pairs with highly similar backgrounds but distinctly different foreground semantics, enabling precise attribution of model errors to genuine hallucination rather than background variation. The benchmark is constructed through PairFlow, a pipeline that leverages recent advances in text-to-image and video generation to systematically compose stories, generate coherent video clips, and assemble them into adversarial pairs. Covering both spatial and temporal reasoning across ten semantic aspects, VidPair-Halluc comprises 1K high-quality adversarial video pairs and 11K spatio-temporal QA pairs with control over background and foreground variations. Evaluations on mainstream LVMs show persistent difficulty with robust fine-grained video understanding in adversarial settings, and code and data are available at the this https URL.
[CV-17] InstanceControl: Controllable Complex Image Generation without Instance Labeling
链接: https://arxiv.org/abs/2606.31924
作者: Xiaoyu Liu,Huan Wang,Fan Li,Zhixin Wang,Jiaqi Xu,Ming Liu,Wangmeng Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions(e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.
[CV-18] MVP-Nav: Multi-layer Value Map Planner Navigator
链接: https://arxiv.org/abs/2606.31919
作者: Wenyuan Xie,Shaokai Wu,Yijin Zhou,Yanbiao Ji,Guodong Zhang,Bayram Bayramli,Qiuchang Li,Xunchu Zhou,Yue Ding,Hongtao Lu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Zero-shot Object Goal Navigation (ZSON) with RGB-only perception poses a fundamental challenge for embodied agents, as the absence of explicit depth information introduces severe physical uncertainty and semantic-physical misalignment. Existing approaches either rely on high-level semantic reasoning without geometric grounding or learn end-to-end policies that lack explicit physical constraints, often resulting in semantically plausible but physically unsafe behaviors. In this paper, we propose MVP-Nav, a physical-aware RGB-only navigation framework that aligns perception, planning, and control with the real 3D world. MVP-Nav reconstructs explicit physical occupancy from monocular observations by leveraging 3D foundation models to project 2D semantic instances into 3D oriented bounding boxes, forming a global spatial semantic representation. To unify high-level semantic reasoning and low-level physical constraints, we introduce a Multi-layer Value Map (MVM) that integrates semantic priorities and reconstructed geometry into a shared cost space, enabling physically grounded geometric planning. Extensive experiments on zero-shot object navigation benchmarks demonstrate that MVP-Nav significantly outperforms existing depth-free methods, achieving state-of-the-art performance and validating that structured physical priors can effectively compensate for the absence of active depth sensors.
[CV-19] DriveWeaver: Point-Conditioned Video Inpainting for Controllable Vehicle Insertion in Autonomous Driving Simulation ECCV2026
链接: https://arxiv.org/abs/2606.31918
作者: Junzhe Jiang,Zipei Ma,Zijie Pan,Li Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026, Project Page: this https URL
Abstract:A pivotal step in autonomous driving simulation involves inserting foreground vehicles with predefined trajectories into simulated scenes. This process enhances scene diversity and facilitates the creation of various corner cases for testing and improving autonomous driving models. However, existing methods often rely on pre-reconstructed 3D assets, which frequently lead to lighting inconsistencies between the inserted foreground and the background. Moreover, the reliance on limited, manually-curated 3D assets hinders large-scale deployment. To address these challenges, we propose DriveWeaver, a novel framework for controllable vehicle insertion in autonomous driving simulation. Specifically, for a masked target insertion area, DriveWeaver performs video inpainting conditioned on vehicle point clouds to generate high-quality, temporally consistent vehicles. This video-inpainting-based approach ensures seamless blending between the foreground and background, while the readily available point cloud conditions enable superior generalization. To support long-term generation, we further design a global-to-local hierarchical inpainting strategy, ensuring the consistent identity and appearance of the inserted vehicles. Meanwhile, we extract explicit 3D Gaussian representations of the inserted vehicles through an urban reconstruction pipeline to enable real-time rendering for autonomous driving simulation. Extensive experiments across diverse datasets demonstrate that our method outperforms existing baselines in visual realism and geometric consistency, providing a robust tool for scalable autonomous driving scene augmentation.
[CV-20] Attend Transform or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference
链接: https://arxiv.org/abs/2606.31903
作者: Zhaoyang Luo,Runmin Dong,Miao Yang,Fan Wei,Yushan Lai,Bin Luo,Haohuan Fu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal large language models (MLLMs) increasingly process long visual-token sequences, increasing the overall inference computation. Existing acceleration methods usually remove visual tokens or skip visual-token updates in entire layers, but these coarse strategies may discard fine-grained evidence or suppress useful operators together with redundant ones. In this paper, we study visual-token computation from an answer-observable perspective and find that late visual-token updates can remain large while having little effect on answer-token representations. Motivated by this answer-silent redundancy, we decompose each Transformer layer into attention and FFN operators and show that useful visual computation is often operator-dominant and layer-dependent. We propose an operator-level visual-token skipping framework that preserves the full visual-token sequence while selectively bypassing redundant attention, FFN, or both. Experiments across three MLLM architectures and 10 VQA benchmarks show that our method achieves strong efficiency-accuracy trade-offs, reducing \textbf33.7% TFLOPs on Qwen3-VL while retaining \textbf99.5% of the vanilla model performance.
[CV-21] RESOLVE: A Multi-Resolution and Multi-Modal Dataset for Roadside Cooperative Perception ECCV2026
链接: https://arxiv.org/abs/2606.31895
作者: Shaozu Ding,Linan Song,Marco De Vincenzi,Dajiang Suo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Including supplementary material
Abstract:LiDAR has increasingly been integrated into traffic cameras to expand coverage and mitigate occlusion in roadside cooperative perception. However, how unimodal and camera-LiDAR fusion architectures behave under variations in LiDAR point sparsity induced by sensor configurations and scene-dependent sensing conditions remains underexplored. We introduce RESOLVE, a large-scale real-world benchmark dataset featuring multi-resolution roadside LiDAR and synchronized camera-LiDAR sensing for systematic evaluation of unimodal and fusion-based architectures in roadside 3D detection and tracking. RESOLVE contains over 100k images and 26k point cloud frames with 220k manually annotated bounding boxes, captured at a real-world urban intersection across diverse lighting and weather conditions and spanning 10 classes of traffic participants. In particular, RESOLVE enables controlled evaluation across three LiDAR resolution levels while keeping all other sensing and environmental factors fixed. This allows fair cross-architecture comparisons under point cloud distribution shifts resulting from resolution variations, sensing distance, and training-inference resolution mismatches. Results from extensive benchmark experiments reveal insights into how multimodal fusion can compensate for LiDAR point sparsity, offering clues for designing cost-efficient roadside multimodal perception. The dataset and benchmark codes are available at this https URL.
[CV-22] Harnessing Textual Refusal Directions for Multimodal Safety
链接: https://arxiv.org/abs/2606.31876
作者: Moreno D’Incà,Massimiliano Mancini,Nicu Sebe
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint
Abstract:To improve safety in Large Language Models (LLMs) we can either perform post-training alignment or exploit refusal directions in the activation space. Both strategies are less feasible in Multimodal LLMs (MLLMs) as they require unsafe multimodal data, harder to collect than their unimodal counterpart. In this work, we relax this constraint and investigate whether textual refusal directions, extracted directly from the LLM backbone, generalize across modalities (i.e., image, video). Preliminary findings confirm this ability, though effectiveness is conditioned by layer selection, steering strength, and cross-modal alignment, with the latter causing safe multimodal inputs to be spuriously steered toward refusal. Building on this, we introduce Modality-Agnostic Refusal Steering (MARS), a light-weight training-free approach that injects multimodal safety without the need for multimodal safety data. MARS corrects modality misalignment via activation re-centering, adaptively scales steering strength within a geometrically defined trust region, and selects the optimal intervention layer, operating at the first generated token. Evaluated on five SOTA MLLMs across safety, utility, and video jailbreak benchmarks, MARS achieves consistent safety gains while preserving utility. These results reveal that safety-relevant structure is shared across modalities and that textual refusal directions are a powerful and underexplored foundation for multimodal alignment.
[CV-23] SENSE-VAD: Sentient and Semantic Video Anomaly Detection for Autonomous Driving
链接: https://arxiv.org/abs/2606.31875
作者: Nghia T. Nguyen,Lokman Bekit,Yasin Yilmaz
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous vehicles (AVs) must navigate not only motion-based hazards but also socially complex situations whose danger is constituted by inter-agent relationships rather than movement statistics alone. A child running away from a guardian, a person being carried by another, or a pursuer chasing a pedestrian across a sidewalk are all anomalous in social context, yet none produces an obvious motion signal that current anomaly detectors are equipped to flag. We introduce SENSE-VAD, the first synthetic video anomaly detection benchmark for autonomous driving explicitly designed around socially complex anomalies. Using the CARLA simulator and Unreal Engine (UE), we generate distinct anomaly scenarios across multiple categories: individual behaviors, group behaviors, person–object interactions, cyclist interactions, vehicle agent, each annotated with per-frame binary labels. A key design principle is the separation of social anomaly from motion-based or appearance-based anomaly: many scenarios involve motion of objects that appears unremarkable in isolation but is anomalous in relational context. We additionally provide real-world normal and anomalous videos as a sim-to-real transfer probe. We evaluate state-of-the-art video anomaly detection baselines and demonstrate that socially complex anomalies constitute a distinct and currently unsolved challenge. Our dataset, annotations, and generation code are publicly available.
[CV-24] owards Voxel Spacing Consistency for Medical Image Segmentation
链接: https://arxiv.org/abs/2606.31839
作者: Xin You,Runze Yang,Minghui Zhang,Hanxiao Zhang,Han Li,Yi Yu,Jie Yang,Nassir Navab,Yun Gu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 6 figures
Abstract:Volumetric medical image segmentation is essential for both preoperative diagnosis and intraoperative guidance. While recent years have witnessed rapid progress in segmentation architectures, comparatively little attention is paid to the physical voxel spacing of anatomical data. Indeed, volumetric image resampling is a ubiquitous preprocessing step before segmentation, yet its interaction with downstream segmentation has not been systematically exploited. In this work, we study the correlation between image resampling and segmentation, and propose Consispace, a semantic-aware resampling framework that achieves consistent voxel spacing in the axial direction while preserving anatomical and semantic consistency. Consispace introduces an ODE-based anatomical constraint to model inter-slice dynamics with a continuous interpolator, enabling faithful reconstruction under complex anatomical transitions beyond discrete interpolation. To further couple resampling with segmentation objectives, we leverage dense features from a pretrained vision model to build intra-slice semantic correlation maps and inject class-wise semantic consistency via feature reweighting during resampling. Both intra-slice and inter-slice constraints are integrated into an implicit neural network, supporting arbitrary-scale resampling. Extensive experiments on multiple datasets demonstrate that Consispace achieves superior reconstruction quality and perceptual fidelity, produces smoother inter-slice anatomy, and improves downstream segmentation performance when used as a preprocessing step.
[CV-25] Real-Time Source-Free Object Detection ECCV2026
链接: https://arxiv.org/abs/2606.31834
作者: Sairam VCR,Varun Gopal,Poornima Jain,Vineeth N Balasubramanian,Muhammad Haris Khan
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ECCV 2026
Abstract:Real-world detectors for autonomous driving, surveillance, and robotics must handle domain-shifts under strict latency and memory constraints, yet existing source-free object detection (SFOD) methods rely on heavyweight architectures that prioritize accuracy alone. We show this trade-off is unnecessary: building on YOLOv10, an NMS-free dual-head detector, we achieve state-of-the-art adaptation accuracy while being faster and more compact. We observe that directly applying vanilla mean-teacher self-training to dual-head detectors leads to suboptimal adaptation performance due to two key factors. First, simple pseudo-label generation strategies, such as using a single head or directly combining high-confidence predictions from both heads, yield suboptimal supervision under domain-shift. We propose DHF (Dual-Head Pseudo-Label Fusion) which selectively admits one-to-one (O2O) and one-to-many (O2M) head predictions, preserving precision and recovering missed objects. Second, we observe domain-shift collapses multi-scale feature discriminability. We propose the use of our MARD (Multi-scale Adaptive Representation Diversification) loss which mitigates this by enforcing detection-aware variance and covariance constraints on multi-scale feature maps. Both modules are training-time only, leaving inference unchanged. Across domain-shift benchmarks, our method, RT-SFOD yields 1.4 to 3.5% mAP gains, 1.3 \times higher throughput, with \sim 2 \times fewer parameters than prior state-of-the-art SFOD methods, thus advancing the Pareto frontier of the speed-accuracy-model size trade-off. We report main results with YOLOv10, and demonstrate generalizability with additional YOLO- and DETR-based dual-head detectors. Code is available here: this https URL
[CV-26] PriorEye: Geospatial Visual Priors for End-to-End Autonomous Driving ECCV2026
链接: https://arxiv.org/abs/2606.31830
作者: Kyuhwan Yeon,Benjamin Ramtoula,Daniele De Martini
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ECCV 2026
Abstract:Most end-to-end autonomous driving methods rely solely on instantaneous sensor observations, limiting them to reactive behavior without the anticipatory foresight human drivers employ through prior experience. We introduce geospatial visual priors, street-level visual context anchored to the intended driving route, providing visual-spatial foresight independent of real-time sensors. We propose a memory augmentation module featuring a dual-memory architecture and an adaptive memory gate, which can be easily integrated into existing end-to-end approaches. This design pairs a contextual memory for retrieved priors with a persistent fallback memory, and dynamically regulates the influence of memories based on current state compatibility. Evaluated on the NAVSIM-v2 benchmark, our approach consistently improves performance across diverse end-to-end baselines. Furthermore, because these priors are independent of onboard sensors, our method inherently improves robustness against sensor corruption, while the dual-memory design ensures safe fallback when the retrieved priors themselves become unreliable. Our project page is available at this https URL.
[CV-27] Breaking Failure Cascades: Step-Aware Reinforcement Learning for Medical Multimodal Reasoning
链接: https://arxiv.org/abs/2606.31825
作者: Junha Jung,Minbyul Jeong,Suhyeon Lim,Sungwook Jung,Jaehoon Yun,Taeyun Roh,Mujeen Sung,Jaewoo Kang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent multimodal large language models have shown great promise in clinical image reasoning, but existing post-training pipelines remain predominantly outcome-centric, relying on final answer correctness or sequence-level preferences. This suffers from sparse credit assignment, making it difficult to optimize the reasoning process essential for clinical applications. Our analysis reveals that cascading errors from early-stage reasoning failures are a leading cause of incorrect predictions in medical visual question answering (VQA) benchmarks. Motivated by this, we propose Medical Reasoning-aware Policy Optimization (MRPO), an RL algorithm that incorporates step-wise process rewards. When the final answer is incorrect, MRPO assigns exponentially larger penalties to tokens in earlier invalid reasoning steps, breaking failure cascades without compromising successful paths. Across three multimodal LLM backbones, MRPO consistently outperforms standard GRPO and a recent RL baseline, and on Qwen3-VL-8B-Instruct even surpasses substantially larger medical MLLMs such as HuatuoGPT-Vision-34B by 2.79 points. Moreover, MRPO reduces early-stage reasoning failures from 64.0% to 13.0%, showing that targeted mitigation of cascading failures improves both reasoning quality and final answer accuracy. Our code is available at this https URL
[CV-28] Absorption-Feature-Guided Distance-Decoupled Estimation and Band Selection for LWIR Hyperspectral Passive Ranging
链接: https://arxiv.org/abs/2606.31824
作者: Shuo Liu,Chen Fan,Zhihe Chen,Xiaolin Huang,Lilian Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures
Abstract:Long-wave infrared (LWIR) hyperspectral observations contain distance-dependent atmospheric absorption signatures, providing a physical basis for long-range passive ranging. However, in natural scenes, these signatures are nonlinearly coupled with target temperature, material emissivity, and path radiance, making distance inversion from observed radiance ill posed. Existing methods typically rely on full-band measurements and pixel-wise joint optimization, which is computationally expensive and does not explicitly exploit sharp atmospheric absorption structures. This paper proposes an Absorption-Guided Distance-Decoupled Estimation and Refinement (ADER) framework for LWIR hyperspectral passive ranging. ADER represents emissivity with B-spline control points under a smoothness prior, suppressing overfitting to atmospheric absorption structures and enabling distance-decoupled estimation. It further uses ozone-absorption cues to classify pixels into emission-dominant and reflection-dominant groups. For emission-dominant pixels, ADER compensates path radiance and transmittance and estimates distance by one-dimensional absorption-residual minimization. For reflection-dominant pixels, ADER refines the initial estimate using downwelling-radiance compensation based on the complete radiative model. To reduce spectral redundancy, ADER also introduces a greedy band selection strategy based on multi-scene effective Fisher information for the distance parameter. Experiments on real scenes show that ADER recovers LiDAR-consistent spatial distance structures under both full-band and 20-band settings, improves ranging accuracy in the evaluated regions, and achieves approximately two orders of magnitude speedup over a public full-band hyperspectral ranging method.
[CV-29] Generative Lane Topology Reasoning via Autoregressive Model with Geometry Prior ECCV2026
链接: https://arxiv.org/abs/2606.31814
作者: Jiahui Fu,Zehao Huang,Han Li,Naiyan Wang,Si Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Lane topology reasoning aims to construct a lane graph from onboard sensor observations. Existing methods follow a detection and association paradigm that treats each lane instance independently, leading to geometric inconsistency at connected endpoints and incomplete graphs due to visual occlusions. To address these issues, we propose TopoGPT, a generative framework that learns the geometry prior from typical lane graph structures through autoregressive sequence modeling. Specifically, we construct a large-scale map dataset comprising 3.3M scenes. For each lane graph, a lane tokenizer serializes it into discrete tokens, while a scene context encoder converts it into a rasterized image and extracts global features as scene tokens. We pre-train an autoregressive lane sequence transformer via scene-conditioned next-token prediction, endowing the model with the geometry prior over lane graph structures. Building upon this prior, a perception adapter aligns BEV features from multi-view images with the pre-trained scene condition, transferring the learned geometry prior to sensor-based lane graph prediction. On the OpenLane-V2 benchmark, TopoGPT outperforms existing methods by an average of +6.4 on lane-level and +11.6 on point-level metrics, and produces geometrically consistent and structurally complete lane graphs.
[CV-30] MuSViT: A Foundation Vision Model for Sheet Music Representation ECCV’26
链接: https://arxiv.org/abs/2606.31811
作者: Carlos Penarrubia,Antonio Rios-Vila,Eliseo Fuentes-Martinez,Juan C. Martinez-Sevilla,Francisco J. Castellanos,María Alfaro-Contreras,Jorge Calvo-Zaragoza
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at European Conference on Computer Vision (ECCV’26)
Abstract:Foundation models have transformed vision and language processing by providing rich, reusable representations that transfer across diverse tasks. Sheet music, as a visual encoding of musical language, lacks such a strong domain-specific backbone. We introduce MuSViT (Music Score Vision Transformer): the first foundation vision model for sheet music representation – a ViT encoder pre-trained via Masked Autoencoders on 9.7 million pages from the IMSLP. To handle the complexity of real-world scores, we adopt a two-stage curriculum: a synthetic warm-up on typeset scores followed by large-scale training on the full IMSLP corpus. We evaluate MuSViT on four downstream tasks – full-page and staff-level music score recognition, music symbol detection, and score difficulty classification – under two scenarios: linear probing (frozen encoder) and fine-tuning. Under linear probing, MuSViT consistently outperforms modern vision encoders, revealing that general-purpose representations, regardless of scale, fall systematically short on the structured symbolic properties of musical notation. Under fine-tuning, MuSViT generally improves upon task-specific state-of-the-art methods. An additional embedding-transcription consistency analysis reveals that MuSViT encodes symbolic musical structure directly in its representation space – unlike other encoders, whose embeddings do not correlate with music notation content. These results establish MuSViT as a foundation backbone for sheet music understanding.
[CV-31] Self-Supervised Temporal Regularization for Landmark-Based Cardiac Segmentation with Automatic AHA Regional Mapping MICCAI2026
链接: https://arxiv.org/abs/2606.31785
作者: David Montalvo-García,Nicolás Gaggion,María J. Ledesma-Carbayo,Enzo Ferrante
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2026
Abstract:Graph-based cardiac segmentation with implicit anatomical correspondences provides topological guarantees and population-level analysis capabilities, but models trained on independent frames of image sequences exhibit temporal discontinuities that affect reliable clinical measurements, particularly in cardiac ultrasound. In this work, we introduce self-supervised temporal regularization as a post-training refinement stage that exploits the temporal coherence in image sequences to enforce consistent cardiac segmentation and motion estimation over time, without requiring per-frame annotations. By penalizing velocity and acceleration discontinuities across consecutive frames, our method achieves temporally consistent segmentations while maintaining the learned anatomical correspondences. We further leverage these correspondences to automatically map landmarks to the AHA 17-segment clinical standard, enabling standardized regional assessment and detection of pathological myocardial motion patterns. Validation on CAMUS dataset demonstrates the clinical utility of combining temporal consistency with automatic regional mapping. The code is publicly available at this https URL
[CV-32] Mesh BDF: Barycentric Dominance Field for 3D Native Mesh Generation
链接: https://arxiv.org/abs/2606.31777
作者: Gaochao Song,Haohan Weng,Luo Zhang,Zibo Zhao,Shenghua Gao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures
Abstract:Autoregressive (AR) modeling has recently achieved remarkable progress in native 3D mesh generation, largely due to its natural ability to handle variable-length, discrete data structures. However, the inherent constraints of the AR paradigm severely restrict the generated meshes, leading to limited face counts, bounded vertex resolutions, and difficulties in supporting textures. To overcome these bottlenecks, we propose the Barycentric Dominance Field (BDF), a continuous representation defined on triangular mesh surfaces that elegantly encodes vertex topological connectivity. BDF bridges the fundamental gap between discrete mesh topology and continuous diffusion-based generative modeling by transforming connectivity into a continuous surface signal. As an intrinsic mesh property, BDF shares strong similarities with texture maps, enabling its seamless integration into existing 3D diffusion pipelines without requiring architectural modifications. Extensive experiments demonstrate that BDF empowers diffusion models to generate native meshes with significantly higher quality, greater scalability, and stronger robustness compared to state-of-the-art autoregressive methods.
[CV-33] NURBS Splatting: A Unified Differentiable Rendering Framework for Vector Graphics ECCV2026
链接: https://arxiv.org/abs/2606.31764
作者: Jingye Qiu,Shizhe Zhou
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Differentiable rendering of planar rational splines remains largely underexplored, despite their widespread use in vector graphics and design. Existing differentiable vector renderers primarily focus on Bézier curves and rely on analytic rasterization, which can suffer from gradient instability and limited flexibility. We propose NURBS Splatting, a unified framework that represents planar rational curves as continuous Gaussian fields. By sampling Gaussians along the curve parameter domain and inside closed regions, rendering is reformulated as a smooth accumulation process with stable gradients. Our method naturally supports long splines, rational weights, non-uniform knots, and closed-region filling. We demonstrate its effectiveness in calligraphy reconstruction, vectorization frameworks, and long-spline image abstraction, showing improved stability and reconstruction quality over existing approaches.
[CV-34] Estimating Velocity of Spheres from Rolling-Shutter Image(s)
链接: https://arxiv.org/abs/2606.31760
作者: Wenjie Xue,Jun Yang,Jingmin Wang,Limin Shang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rolling-shutter cameras introduce characteristic distortions when imaging fast moving objects, and these effects are typically treated as artifacts to be corrected. In this work, we instead leverage rolling-shutter distortions as a valuable source of temporal information to estimate the 3D translational and angular velocities of rapidly moving spherical objects from a single rolling-shutter frame. We design a robust and easily detectable spherical pattern and propose a correspondence-free formulation that recovers motion by enforcing geometric consistency in a back-projection framework. By exploiting the geometry of the sphere, translational and rotational motions are decoupled and estimated through a two-stage optimization process, enabling reliable velocity recovery even for textureless objects. Extensive experiments on both synthetic and real datasets demonstrate accurate and robust estimation of motion parameters under challenging high-speed conditions.
[CV-35] JL1-CCQA: Extending the JL1-CD Benchmark with Change Captioning and Question Answering
链接: https://arxiv.org/abs/2606.31745
作者: Ziyuan Liu,Ruifei Zhu,Ouqiao Ma,Yuantao Gu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 8 figures
Abstract:Remote sensing change detection (CD) traditionally focuses on pixel-level binary segmentation, which identifies where changes occur but neither what nor why. To bridge this semantic gap, we introduce JL1-CCQA, a multi-task benchmark that extends the JL1-CD dataset with two complementary annotation layers: change captioning (CC) and change question answering (QA). Built upon 5,000 bi-temporal image pairs acquired by the Jilin-1 satellite at 0.5-0.75m ground sample distance, the benchmark comprises: (i) JL1-CC, providing 17,021 quality-verified captions that describe diverse land-cover transformations; and (ii) JL1-QA, offering 20,060 question-answer pairs across eight question types, enabling fine-grained, interactive interrogation of surface changes. All annotations are produced via a three-stage pipeline consisting of multi-modal large language model (LLM) generation, vision-grounded LLM judging, and human expert verification. We hope that JL1-CCQA, as a benchmark unifying binary change masks, change captions, and change-oriented QA over the same image set, will serve as a valuable resource for the community to advance multi-task change understanding in remote sensing. The dataset is available at this https URL.
[CV-36] Rhythm-Structured Predictive Learning for Remote Photoplethysmography
链接: https://arxiv.org/abs/2606.31736
作者: Ba-Thinh Nguyen,Huu-Dung Nguyen,Thi-Duyen Ngo,Thanh-Ha Le
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote photoplethysmography (rPPG) estimates physiological signals from facial videos by analyzing subtle pulse induced skin color variations. Despite recent progress, existing self-supervised rPPG methods mainly reconstruct masked pixels or low-level visual representations, which can bias the model toward facial appearance rather than latent physiological dy namics. Moreover, most recent Mamba-based approaches scan facial video tokens only in chronological order, limiting their ability to exploit the cyclic structure of pulse signals. To ad dress these limitations, we propose RhythmJEPA, a rhythm structured joint-embedding predictive learning framework for rPPG. Instead of reconstructing RGB frames, RhythmJEPA predicts latent teacher representations from masked facial videos, thereby encouraging physiology-aware representation learning in the embedding space. To explicitly model pulse-related tem poral structure, we introduce a Cyclic Rhythm-State Plan ner (CRSP), which estimates frame-wise latent physiological states and decodes the most plausible cyclic state path via dynamic programming with a constrained transition grammar. Guided by the decoded states, we further design a Dual Order Mamba Encoder (DOM), which combines conventional chronological scanning with state-ordered scanning to capture both local temporal continuity and long-range rhythm-consistent dependencies. Finally, a lightweight Spatial Pulse Mixer (SPM) extracts compact pulse-sensitive facial tokens with a favorable balance between complexity and performance. Experiments on PURE, UBFC-rPPG, and MMPD show competitive performance over representative rPPG methods. The codes are available at this https URL.
[CV-37] MemLearner: Learning to Query Context memory for Video World Models ECCV2026
链接: https://arxiv.org/abs/2606.31734
作者: Jiwen Yu,Jianxiong Gao,Jianhong Bai,Yiran Qin,Kaiyi Huang,Quande Liu,Xintao Wang,Pengfei Wan,Kun Gai,Xihui Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026, Project Page: this https URL
Abstract:Video World Models are interactive video generation models that predict future world states based on user actions and history video frames. A critical challenge in video world models is the lack of memory, causing inconsistent generated scenes over extended durations. Previous methods explored rule-based context frame retrieval as memory, but they fail to generalize in scenarios with scene occlusions and dynamic objects. We propose MemLearner, a learning-based adaptive context query method using query tokens to bridge context and predicted tokens. By leveraging the video generation model itself for context querying, MemLearner exploits pre-trained visual priors without training additional modules from scratch, and incorporates efficient strategies for training and inference. We collect a dataset of long videos with scene occlusions and dynamic objects, paired with camera pose annotations, and propose a multi-dataset training strategy leveraging both annotated rendered and unannotated real-world videos. Extensive experiments demonstrate that MemLearner significantly outperforms prior video world models in terms of scene consistency and memory, particularly under challenging occlusion and dynamic scenarios.
[CV-38] UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization
链接: https://arxiv.org/abs/2606.31732
作者: Yaozhi Zheng,Yilei Jiang,Manyuan Zhang,Yuxuan Wan,Kaituo Feng,Tianshuo Peng,Bo Zhang,Xiangyu Yue
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual-to-Code generation, which transforms scientific plots, vector graphics, and webpages into executable scripts, demands a level of pixel-precise alignment that standard Multimodal Large Language Models (MLLMs) fail to achieve through Supervised Fine-Tuning (SFT) alone. While Reinforcement Learning (RL) offers a theoretical pathway to bridge this gap, its application is hindered by two fundamental obstacles: (1) \textitReward Coarseness, where semantic metrics like CLIP scores fail to penalize fine-grained element deviations, and (2) \textitExploration Stagnation, where the sparse, heterogeneous code search space prevents the policy from bootstrapping valid trajectories. To overcome these limitations, we introduce UniCoder, a unified RL framework that integrates two novel mechanisms. First, we propose \textbfSymbolic Attribute Alignment, which employs a lightweight auxiliary LLM to parse generated code into discrete visual attributes (e.g., hex colors, coordinate limits), enabling dense, element-wise reward computation. Second, to escape local optima, we devise \textbfReference-Guided Code Optimization, a strategy that dynamically injects ground-truth trajectories into low-performing rollout groups, transforming blind exploration into guided policy improvement. Extensive experiments on ChartMimic, UniSVG, Design2Code and ScreenBench benchmarks demonstrate that our 8B-parameter model not only surpasses all open-source baselines but also achieves state-of-the-art performance comparable to proprietary models, establishing a new paradigm for generalized visual-to-code synthesis.
[CV-39] Semantic-Aware Multiple Access via Spatial Redundancy Exploitation for Uplink-Dominant 6G Use Cases
链接: https://arxiv.org/abs/2606.31715
作者: Hamidreza Mazandarani,Masoud Shokrnezhad,Tarik Taleb,Onur Günlü
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Emerging uplink-dominant 6G use cases, such as cooperative vehicular streaming, require efficient transmission of high-volume visual data over limited wireless resources. While semantic communications can reduce traffic by prioritizing task-relevant content, most existing approaches treat users independently and therefore overlook spatial redundancy among nearby devices’ observations. This paper proposes a semantic-aware multiple access scheme that exploits overlapping fields of view among vehicular users to reduce redundant uplink transmissions. We formulate a joint perception and transmission control problem in which users decide which image patches to transmit, when to transmit them, and over which channel, subject to communication constraints. To address the resulting complexity, we introduce a practical two-phase approach. First, nearby vehicles share selected observation patches over Vehicle-to-Vehicle (V2V) links to calculate inter-user spatial redundancy. Second, users transmit only semantically important, non-redundant patches to the base station, where observations can be reconstructed using the received patches and complementary views from neighboring vehicles. Simulation results in a dense urban vehicular scenario demonstrate that our approach improves the proportion of users who achieve high-fidelity reconstruction, highlighting the potential of semantic-aware multiple access for sustainable and resource-efficient 6G uplink systems.
[CV-40] WIDER-FAIR: An Annotated Version of the WIDER-FACE Dataset for Fairness Evaluation
链接: https://arxiv.org/abs/2606.31704
作者: Maxime Moussi,Benoît Ronval,Siegfried Nijssen,Félicien Schiltz
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:The deployment of face detection models in real-world applications raises important fairness concerns, as these systems may showcase performance disparities across demographic groups. A key obstacle to studying and mitigating such biases is the lack of face detection datasets with sensitive feature annotations. To address this gap, we introduce WIDER-FAIR, a new dataset built on the widely used WIDER-FACE benchmark, manually annotated with the perceived ethnicity and sex of each face. The dataset contains 16,256 images annotated across four ethnic groups: Asian, Black, Indian, and White, and two sex categories. We assess the quality and coherence of the annotations using face embeddings, a K-Nearest Neighbors classifier, and a t-SNE visualization, all of which support the consistency of the labeling process. As a demonstration of the dataset’s potential, we train a YOLOv5 model and perform ablation studies on each sensitive feature. Among other findings, our experiments show that detection performance is notably lower for faces of Black individuals, and that excluding this group from training increases fairness disparity more than excluding any other ethnic group. These observations illustrate the value of demographically annotated datasets for understanding and evaluating bias in face detection models.
[CV-41] Phantom: A Unified Face-Swap Deepfake Protection Framework with Latent and Spatial Constraints CVPR2026
链接: https://arxiv.org/abs/2606.31703
作者: Jungkon Kim,Cheolseung Jung,Jong-Min Choi,Juseong Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to CVPR 2026 (Findings)
Abstract:Face-swapping deepfakes pose an escalating threat to personal privacy by enabling unauthorized identity manipulation. While adversarial approaches have demonstrated success against black-box face recognition (FR) models, their applicability to face-swapping scenarios remains underexplored. In particular, reliance on fixed or random targets yields ambiguous latent guidance, and the lack of explicit spatial constraints causes perturbations to spill into identity-irrelevant regions. These issues are further exacerbated by identity-style disentanglement, which suppresses adversarial signals during deepfake generation. In this paper, we present Phantom, a unified face-swap deepfake protection framework that jointly constrains perturbations in latent and spatial domains. Phantom adaptively synthesizes identity-shifted yet attribute-preserving targets to guide identity-aware latent optimization, and applies masked perturbations confined to semantically relevant facial regions. Extensive experiments on state-of-the-art face-swapping deepfakes demonstrate that Phantom improves protection success rates in dodging scenarios by 27.8%, 25.6%, and 16.6% on UniFace, INSwapper, and SimSwap, respectively, while also enhancing visual quality. Furthermore, Phantom generalizes to impersonation scenario, yielding up to 10.2% higher protection while improving perceptual fidelity. These results underscore the effectiveness of jointly leveraging latent and spatial constraints for robust and coherent facial privacy protection.
[CV-42] Look But Dont Touch with Sparse Autoencoders for Unlearning in Diffusion Models
链接: https://arxiv.org/abs/2606.31699
作者: Enrico Cassano,Riccardo Renzulli,Rayyan Ahmed,Marco Grangetto,Stephan Alaniz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Sparse autoencoders (SAEs) have recently been proposed as interpretable tools for concept-level manipulation, under the assumption that isolated features can serve as controllable intervention points. In this work, we systematically evaluate this assumption in the context of object erasure and steering in diffusion models. We show that while SAEs reliably detect and localize semantic concepts within diffusion model activations, direct intervention in their latent space frequently induces out-of-distribution activations, resulting in severe visual artifacts. To disentangle detection from intervention, we use SAE activations purely as semantic detectors to identify image regions containing the target object, and replace those patch embeddings with the ones that do not contain it. This detection-based replacement preserves the diffusion model’s activation statistics and produces significantly cleaner erasure results than latent steering. Our findings reveal a fundamental gap between concept detection and concept intervention in diffusion models: monosemantic or sparse features are not inherently suitable as control knobs for steering. These results position SAEs as powerful interpretability tools for analyzing generative models, but highlight important limitations when used for direct manipulation, such as unlearning.
[CV-43] Intrinsically Stable Spiking Neural Networks: Overcoming the Performance Barrier in the Absence of Batch Normalization ECCV2026
链接: https://arxiv.org/abs/2606.31695
作者: Ruichen Ma,Xiaoyang Zhang,Jian Bai,Guanchao Qiao,Liwei Meng,Ning Ning,Yang Liu,Shaogang Hu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026 Accepted
Abstract:The performance of deep spiking neural networks (SNNs) often relies on batch normalization (BN). However, the advanced dynamic BN variants used in state-of-the-art models introduce runtime multiplications, which weaken the hardware-efficiency motivation of SNNs. To address this tension, we identify catastrophic firing-rate decay as a primary cause of severe performance degradation in normalization-free SNNs. Guided by this insight, this work proposes the Intrinsically Stable SNN (IS-SNN) architecture, which removes activation-normalization layers by enforcing signal homeostasis through topology-aware weight standardization and modified residual connections. By folding the standardization operations into static weights offline, IS-SNN removes the runtime statistics tracking and multiplications introduced by activation normalization, restoring an accumulation-oriented inference datapath. Comprehensive experiments show that IS-SNN achieves performance competitive with or superior to computationally expensive dynamic BN techniques across VGG, ResNet, and Transformer-based models. Notably, it achieves a competitive accuracy of 68.05% on ImageNet and overcomes the severe depth limitations of prior BN-free attempts. Together with a 96.4% reduction in FPGA lookup table resource consumption for neuron implementations, these results support IS-SNN as a practical framework for building accurate and hardware-friendly deep neuromorphic systems.
[CV-44] Semantic Occupancy Prediction with Dual Range-Voxel Representation
链接: https://arxiv.org/abs/2606.31688
作者: Sitao Chen,Zhuangwei Zhuang,Hui Luo,Lizhao Liu,Qingyao Wu,Mingkui Tan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:LiDAR-based 3D semantic occupancy prediction, which aims to provide accurate and comprehensive scene representation, is crucial for autonomous driving systems. As point clouds suffer from sparsity and incompleteness, leading to insufficient semantic learning and difficult occupancy perception, existing methods often stack multi-sweep point clouds to obtain dense spatial information. However, such a naive strategy also results in efficiency (e.g., additional computational burden) and robustness (e.g., pose transformation noise) concerns, which hinder their practical applications. In this work, we propose a Dual Range-Voxel Representation (DRVR) that leverages the range-view context and voxel-view geometry of single-sweep point clouds for 3D semantic occupancy prediction, eliminating the concerns associated with the multi-sweeps. Specifically, we use the range-view encoder to extract the compact context of the scene. To fully exploit the spatial information, we design a geometry-aware voxel-view encoder that extracts multi-scale voxel-view features separately and combines them for better geometric occupancy prediction. Moreover, we propose a range-voxel fusion module to cooperate range- and voxel-view features via voxel-to-range and range-to-voxel fusions. Extensive experiments on nuScenes-Occupancy, SemanticKITTI and SemanticPOSS show the superiority of our method. Especially on nuScenes-Occupancy, our single-sweep DRVR achieves 5.4% improvement in mIoU and 2.1x acceleration compared to the multi-sweep method.
[CV-45] Histogram-constrained Image Generation ECCV2026
链接: https://arxiv.org/abs/2606.31683
作者: Haoming Liu,Yuanhe Guo,Yijia Cao,Shenji Wan,Hongyi Wen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to ECCV 2026; 31 pages, 16 figures
Abstract:Diffusion models have emerged as a dominant paradigm in generative modeling, enabling high-fidelity sampling from complex data distributions. Despite impressive capabilities, controlling diffusion models to produce outputs aligned with user intent remains an open challenge, especially when balancing global coherence with local precision. Existing control mechanisms vary in the granularity of their conditioning signals. For example, textual prompts guide generation globally through high-level semantics, while ControlNet-like approaches secure precise local structure via dense conditions. In this work, we introduce Histogram-constrained Image Generation (HIG), a novel control mechanism that falls into the middle ground of control granularity. Our framework enforces user-specified distributional constraints (e.g., color histograms or latent token distributions) during the generation process with exact precision. We model such control as an optimal transport (OT) problem and apply explicit guidance transformations during sampling, thereby driving the diffusion trajectory to align with the desired histogram. We demonstrate the versatility of HIG across diverse applications, including constrained generation via color/latent histograms and high-capacity information embedding through histogram-level encoding. Our findings underscore the promise of distributional control, a flexible and interpretable control scheme that is fully compatible with existing control mechanisms, diversifying the hybrid strategies for controllable image generation. Our project page is available at: this https URL.
[CV-46] ShellM aker: Language-Guided Exterior Completion under Structural Constraints ECCV2026
链接: https://arxiv.org/abs/2606.31680
作者: Ruiqi Xu,Daniel Aliaga
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:Despite advances in indoor scene generation, synthesizing coherent building exteriors consistent with generated interiors remains largely unexplored. Existing methods can generate floor plans and wall layouts but typically stop at a structural shell, lacking stylistically consistent facades and roofs. Completing these exteriors is challenging because the footprint, wall geometry, and opening semantics must remain fixed-constraints that unconstrained generative models often violate. We introduce ShellMaker, a language-guided exterior completion framework that operates under these structural constraints. Given a building scaffold and a text style prompt, ShellMaker generates a complete exterior mesh with PBR materials by combining parametric roof generation, LLM-based part-aware prompt refinement, joint wall-roof material retrieval, and geometry-aware assembly. Operating on a format agnostic scaffold representation, ShellMaker generalizes to indoor generators, CityGML, and CAD inputs, while maintaining structural consistency and improving architectural coherence over retrieval and unconstrained generative baselines. The project page is available at this https URL
[CV-47] Practical High-Fidelity Novel-View Synthesis of Mounted Lepidoptera
链接: https://arxiv.org/abs/2606.31679
作者: Kristof Overdulve,Lode Jorissen,Nick Michiels
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mounted butterflies are among the most striking objects in natural history collections. However, their beauty is notoriously hard to digitize in 3D: they are small and fragile, with microscopic hairs and vein structures. Capturing them in sufficient detail, therefore, requires a macro lens, which has a very limited Depth of Field (DoF). Moreover, a camera body cannot be maneuvered beneath a pinned specimen to photograph its ventral surface (the underside of the wings). We introduce an end-to-end pipeline that resolves these challenges to turn such specimens into photo-realistic 3D models viewable from every direction. It combines three ingredients: handheld focus stacking for all-in-focus macro capture without a tripod, a non-contact first-surface mirror system that exposes the ventral surface without touching the specimen, and a segmentation-free, mirror-aware 3D Gaussian Splatting extension. We validate the reconstructions on four diverse specimens.
[CV-48] REDI: Corpus Aware Patch Ranking for DINOv3 Token Reduction
链接: https://arxiv.org/abs/2606.31676
作者: Chanjong Im,Sebastian Diem,Thomas Mandl
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 2 figures, 3 tables
Abstract:Most token reduction methods for Vision Transformers seek favorable tradeoffs between accuracy and efficiency by pruning, merging, or pooling patch tokens. REDI (Relevance for DINOv3 Token Reduction) studies this question through a controlled supervised reference: how should a fixed token budget be allocated across patches for image classification? REDI quantizes final block DINOv3 patch representations into a visual vocabulary and derives class conditioned corpus scores using supervised TF-IDF over visual words. For each validation image, the ground truth class selects a row of the TF-IDF table, and four transformed views produce a TF-IDF map aligned to a reference center crop. A separate dense pass on the same crop provides an attention map. After independent min max normalization, their elementwise product defines the REDI score. A fixed keep, merge, and compress operator then uses score rank to assign patch roles and score magnitude to weight merging and compression. With precomputed REDI scores, a frozen DINOv3 ViT-B/16 backbone, and the same linear classifier used for dense evaluation, the operator reduces the sequence length from 201 to 107 tokens, a 46.8% sequence reduction. The REDI variant based on incoming attention mass achieves 84.706% Top-1 accuracy on ImageNet-1K, compared with 83.514% for the dense baseline, 82.634% for incoming attention mass alone, and 81.796% for supervised TF-IDF alone. The same corpus term also improves reduced classification for three alternative attention formulations relative to their attention only counterparts. Together, these controlled comparisons indicate that class specific corpus statistics and image specific attention provide complementary signals for patch ranking in this setting. Comments: 10 pages, 2 figures, 3 tables Subjects: Computer Vision and Pattern Recognition (cs.CV) MSC classes: 68T45 ACMclasses: I.2.10 Cite as: arXiv:2606.31676 [cs.CV] (or arXiv:2606.31676v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.31676 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-49] WorldRoamBench: An Open-World Benchmark for Long-Horizon Stability of Interactive World Models
链接: https://arxiv.org/abs/2606.31672
作者: Ting-Bing Xu,Jiacheng Sui,Zhe Gao,Kewei Shi,Wenjin Yang,Zhicheng Liu,Zhaoxu Sun,Mingchao Sun,Hongyu Pan,Fan Jiang,Mu Xu,Qi Fan,Yong Li,Baoquan Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Despite rapid progress in interactive world models (IWMs), existing benchmarks evaluate action following only at trajectory level and ignore memory and interaction physics. We introduce WorldRoamBench, an open-world benchmark for long-horizon stability across four dimensions, each with tailored innovations: (i) Action: per-frame action metric bypassing cross-model semantic scale disparity and exposing failures hidden by trajectory; (ii) Vision: segment-based drift metric capturing non-monotonic mid-sequence collapse missed by start-vs-end comparisons; (iii) Physics: controllability-gated evaluation over mechanics, optics, and 3D consistency, scoring plausibility under faithful action execution; (iv) Memory: action-decoupled protocol evaluating scene memory via transition-localized 3D point-cloud reconstruction and subject memory via tracking-plus-VLM reasoning. The benchmark comprises 600+ test cases across Nature, Urban, and Indoor scenes in first/third-person views with WASD 10-60s continuous interaction. Evaluating 10+ open/closed-source models reveals none reliably satisfies all dimensions; even the best achieves only moderate scores. Advances on WorldRoamBench are steps toward IWMs that are stable, physically grounded, memory-faithful, and deployable in real-world applications.
[CV-50] SAMBA: A Scatter-Guided Masked Bidirectional Mamba Foundation Model for SAR Target Recognition
链接: https://arxiv.org/abs/2606.31668
作者: Ke Wang,Xiaoyi Pan,Zhaoyu Gu,Xiaofeng Ai,Zhiming Xu,Feng Zhao,Shunping Xiao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 5figures
Abstract:Synthetic aperture radar automatic target recognition (SAR ATR) is critical for Earth observation and defense, but its practical deployment is constrained by scarce annotated training data. Self-supervised pre-training alleviates this label bottleneck, yet prevailing Transformer architectures incur prohibitive quadratic computational complexity, and conventional universal masking neglects the unique electromagnetic scattering properties intrinsic to SAR imagery. To address these limitations, we propose SAMBA (Scattering-Guided Bidirectional Mamba), an efficient self-supervised pre-training foundation model for SAR target interpretation. Our framework features three core innovations: (i) a linear-complexity Mamba encoder with a mid-sequence class token to mitigate computational bottlenecks; (ii) a three-level hierarchical Scattering-Guided Masked Autoencoder (SG-MAE) masking strategy guided by SAR physical priors, aligning the pretext task with SAR’s intrinsic imaging mechanism; (iii) a lightweight SpatialMix feature interaction module to enhance cross-region feature fusion. We also design a two-stage cross-domain pre-training pipeline to optimize the overall pre-training process. Extensive evaluations demonstrate that SAMBA consistently delivers superior performance across all pre-training configurations, with substantially fewer parameters than both CNN and Transformer baselines. Compared with the default masking strategy in standard MAE, the proposed SG-MAE strategy further boosts the model’s few-shot transfer capability. Benchmarking on seven downstream datasets covering classification and detection tasks shows SAMBA achieves state-of-the-art (SOTA) performance on most metrics, fully validating its robust generalizability across diverse SAR interpretation tasks. Source code and pre-trained weights are publicly available at this https URL.
[CV-51] Sparsity-Inducing Divergence Losses for Biometric Verification ECCV2026
链接: https://arxiv.org/abs/2606.31664
作者: Dimitrios Koutsianos,Ladislav Mošner,Yannis Panagakis,Themos Stafylakis
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ECCV 2026
Abstract:Performance in face and speaker verification is largely driven by margin-penalty softmax losses such as CosFace and ArcFace. Recently introduced \alpha -divergence loss functions offer a compelling alternative, particularly due to their ability to induce sparse solutions (when \alpha1 ). However, standard geometric margins are designed for the softmax function and do not naturally extend to this generalized probabilistic framework. In this paper we propose Q-Margin, a novel \alpha -divergence loss that introduces a principled probabilistic margin. Unlike conventional methods that apply geometric penalties to the logits (unnormalized log-likelihoods), Q-Margin encodes the margin penalty directly into the reference measure (prior probabilities). This formulation naturally encourages discriminative embeddings while preserving the beneficial sparsity properties of the \alpha -divergence. We demonstrate that Q-Margin achieves competitive or superior performance on the challenging IJB-B and IJB-C face verification benchmarks and similarly strong results in speaker verification on VoxCeleb. Crucially, against ArcFace and CosFace baselines trained under an identical recipe, Q-Margin consistently improves at low False Acceptance Rates (FARs), a capability critical for practical high-security applications. Finally, the extreme sparsity of the Q-Margin posteriors enables exact and memory-efficient training, offering a scalable solution for datasets with millions of identities.
[CV-52] DynFly: Dynamic-Aware Continuous Trajectory Generation for UAV Vision-Language Navigation in Urban Environments
链接: https://arxiv.org/abs/2606.31654
作者: Wen Jiang,Hanfang Liang,Li Wang,Kangyao Huang,Wang Xu,Wei Fan,Jinyuan Liu,Shaoyu Liu,Hongwei Duan,Bin Xu,Xiangyang Ji,Huaping Liu
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 34 pages, 9 figures
Abstract:Recent advances in multimodal large models have significantly improved UAV vision-language navigation (UAV-VLN) by enhancing high-level perception and reasoning. However, existing methods mainly focus on predicting discrete actions, local targets, or sparse waypoints, while the continuous transition from navigation intent to executable UAV motion remains weakly modeled. This motion-interface gap limits the continuity, stability, and executability of generated UAV trajectories. To address this gap, we propose DynFly, a dynamic-aware continuous trajectory generation framework that bridges high-level navigation reasoning and executable UAV motion. DynFly bridges high-level navigation intent and continuous UAV motion through a lightweight trajectory generation layer. Specifically, it represents expert trajectories in B-spline control-point space and employs a Spline-DiT generator to learn conditional trajectory generation via flow matching. Furthermore, we introduce UAV-oriented dynamic-aware supervision over position, finite-difference velocity, finite-difference acceleration, heading consistency, and local target alignment, enabling the generated trajectories to better satisfy UAV motion characteristics. And our trajectory generation framework can also be integrated with an existing UAV-VLN framework while preserving its original visual-language reasoning pipeline. Extensive experiments on the OpenUAV UAV-VLN benchmark show that DynFly improves both navigation performance and trajectory quality. On the Test Unseen Full split, DynFly improves the strongest baseline by 4.69 NDTW, 2.40 SDTW, 2.14 SR points and 4.87 OSR points, while reducing NE by 4.51 m.
[CV-53] chnical Report of RoboSpatial Challenge at CVPR 2026: Selective Reasoning Activation and Reference-Frame Disambiguation for Embodied Spatial Reasoning
链接: https://arxiv.org/abs/2606.31645
作者: Yuxiang Xie,Qi Lv,Jianming Xing,Zijian Hong,Xiang Deng,Weili Guan,Liqiang Nie
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models achieve strong general perception but often struggle with the spatial reasoning required for embodied tasks. We present RoboSpatialBrain, our submission to the RoboSpatial Challenge at the Embodied Reasoning in Action Workshop, CVPR 2026, built on RoboBrain2.5-8B-NV. RoboSpatialBrain combines two training-free, inference-time mechanisms: a forced think prefix activation strategy paired with a task-specific post-prompt that elicits deliberate reasoning on context and compatibility tasks, and an explicit reference-frame redirection pipeline that resolves camera-centric and object-centric ambiguity for context tasks. We additionally explore fine-tuning RoboBrain2.5 on compatibility data and present a detailed analysis of its interaction with prompting. RoboSpatialBrain achieved first place in the RoboSpatial Challenge, with an overall success rate of 80.9% on RoboSpatial-Home. Code is available at this https URL.
[CV-54] LiteMatch: Lightweight Zero-Shot Stereo Matching via Cost Volume Stabilization
链接: https://arxiv.org/abs/2606.31636
作者: Md Raqib Khan,Santosh Kumar Vipparthi,Subrahmanyam Murala
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite rapid progress in learning-based stereo matching, high accuracy is often achieved at the cost of heavy backbones and computationally intensive 3D cost volume processing, resulting in substantial memory and runtime overhead. More critically, these methods frequently struggle to generalize across domains, limiting their practical deployment. We present \textitLiteMatch, a lightweight stereo matching framework that achieves strong zero-shot generalization through cost volume stabilization-without expensive 3D convolutions. LiteMatch employs two complementary encoders: a Cross-View Correspondence Encoder (CVCE) to capture global cross-view interactions, and a High-Frequency Encoder (HFE) that enhances fine structural details via FFT-based frequency cues. To stabilize the cost volume, we introduce the \textitCost Volume Consistency Loss (CVC-Loss), a voxel-wise binary cross-entropy objective applied to softmax-normalized cost distributions. By encouraging sharp and unimodal disparity probabilities, CVC-Loss promotes stable cost distributions and enables rapid convergence. A lightweight refinement module further produces sharp full-resolution disparities with low-iteration updates, avoiding heavy recurrent refinement. With a flexible design ranging from 3.36M to 9.58M parameters, LiteMatch achieves exceptional zero-shot generalization, delivering competitive EPE and D1 performance across Scene Flow, KITTI, Middlebury, ETH3D, and DrivingStereo. Our results establish that lightweight architectures can indeed generalize across domains without sacrificing accuracy. \hrefthis https URL\textcolorblueCode
[CV-55] PrISM-IQA: Image Quality Assessment Made Practical for Smartphone Photography
链接: https://arxiv.org/abs/2606.31626
作者: Shuyan Zhai,Jiaqi He,Weixia Zhang,Liang Wang,Zhenjie Lee,Zufeng Zhang,Kede Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Existing smartphone image quality assessment (IQA) methods commonly reduce perceptual quality to a single score. However, this scalar formulation is poorly aligned with practical image signal processor (ISP) tuning, where engineers must identify specific quality issues, estimate their severities, and determine whether they are acceptable or require intervention. In this work, we introduce a Practical ISP-aware Structured Model for IQA (PrISM-IQA), which reformulates smartphone IQA as a multi-issue ordinal diagnosis problem. Rather than regressing a single quality score, PrISM-IQA predicts an \textitordered severity level – absent, minor, severe, or critical – for each ISP-relevant issue, covering both global image-level artifacts and local content-dependent defects. To produce logically consistent predictions, PrISM-IQA combines cumulative ordinal encoding with structured inference that captures within-issue monotonicity as well as cross-issue subsumption and exclusion relations. We evaluate PrISM-IQA on a reconstructed SPAQ benchmark annotated with 53 ISP-relevant quality issues and on a small-scale expert-annotated real-world dataset. Experimental results demonstrate the effectiveness of PrISM-IQA for practical issue-level diagnosis, reveal transferable perceptual quality representations through linear probing, and further show how its predictions can support actionable and meaningful ISP tuning.
[CV-56] Robust Autonomous UAV Landing on Maritime Platforms via Multimodal Agent ic AI and Active Wave Compensation
链接: https://arxiv.org/abs/2606.31613
作者: Francisco S. Neves,Pedro N. Pereira,Raul D.S.G. Campilho,Andry M. Pinto
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Autonomous aerial inspection of marine infrastructure is frequently compromised by stochastic sea states, introducing risks of high-kinetic impacts, post-landing toppling, and sensory occlusion. This paper proposes a decoupled, multi-vehicle landing framework synchronizing an Unmanned Surface Vehicle (USV) equipped with a 3-RPU stabilized platform with a robust Unmanned Aerial Vehicle (UAV). The architecture utilizes two independent Deep Reinforcement Learning (DRL) agents: a Soft Actor-Critic (SAC) agent providing high-frequency wave-motion compensation for the landing deck, and a multimodal RL agent for the UAVs final approach. Evaluated in high-fidelity maritime simulations, the system achieved a 100% landing success rate across 15 trials in wave states varying from calm to rough. Results show a mean stabilization efficacy of 87.8%, maintaining the landing surface within 1 degree of the horizontal plane for 96% of the mission duration in rough conditions, effectively contributing to safer landings.
[CV-57] What Memory Do GUI Agents Really Need? From Passive Records to Active Task-Driving States
链接: https://arxiv.org/abs/2606.31612
作者: Chen Liu,Ling Chen,Hanzhang Zhou,Xu Zhang,Quyu Kong,Panrong Tong,Wenhao Wang,Xin Yu,Steven Hoi,Yue Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Mobile GUI agents increasingly face long-horizon tasks that require reading, updating, and reusing task-relevant data across pages and applications. Existing memory methods treat memory largely as passive storage, where past observations are accumulated and retrieved when needed. Yet retrieving a value does not reveal its current role in the workflow. The agent must still infer from accumulated records whether the value should be used now, has already been used, or must wait for a later dependency. This implicit reconstruction becomes unreliable in long trajectories with similar fields, repeated values, distractors, and outdated states, causing repeated or missed operations. We propose Active Task Driving Memory (ATMem), which shifts GUI-agent memory from passive storage to an actively maintained execution state. ATMem maintains task-relevant information as a continually updated execution state that links each value to its role and current status, enabling action selection based on the current workflow state. We therefore introduce \textbfSTR-GRPO, an online reinforcement learning method that learns to use ATMem selectively according to its contribution to task completion. STR-GRPO contrasts memory-on and memory-off rollouts to estimate when memory use improves execution, while memory-cost-aware reward discourages costly memory usage that does not improve execution. To evaluate whether agents can complete all in-scope work while avoiding out-of-scope actions over long-horizon execution, we build a challenging mobile benchmark. From a list of near identical entries, agents must act on every entry that satisfies the instruction and reject entries that violate its constraints.
[CV-58] Learning Structurally Consistent Representations for Multi-View Radar Semantic Segmentation
链接: https://arxiv.org/abs/2606.31609
作者: Ali Zia,Muhammad Umer Ramzan,Abdelwahed Khamis,Usman Ali,Abdul Rehman
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Radar sensors provide reliable perception under adverse weather and lighting conditions, but their sparse, noisy, and weakly semantic measurements make dense semantic segmentation challenging. Most existing radar segmentation methods rely on grid-based encodings and pairwise interactions, which struggle to capture the higher-order relational structure formed by multiple radar returns from the same physical object. We introduce a unified higher-order structural alignment framework for multi-view radar segmentation. The proposed method refines radar feature representations using learnable hypergraphs to capture higher-order dependencies among spatially related responses. To ensure consistency across heterogeneous radar projections, we further align view-specific features using Unbalanced Optimal Transport (UOT), enabling correspondence-free alignment under varying measurement densities and partial observations. An adaptive attention mechanism then fuses complementary radar views while emphasising structurally informative responses under sparsity and noise. The resulting architecture learns structurally consistent representations across Range Angle (RA), Range Doppler (RD), and Angle Doppler (AD) views and is trained using supervised segmentation together with cross-view consistency regularisation. Experiments on the CARRADA and RADIal benchmarks demonstrate consistent improvements over strong radar-specific baselines, achieving 63.8% mIoU on CARRADA and 83.4% mIoU on RADIal, improving the previous best methods by +1.7 and +2.3 mIoU, respectively. These results highlight the importance of higher-order relational modelling for robust radar perception.
[CV-59] Preserve the Hard Regenerate the Rest: Uncertainty-Guided Synthetic Training Data Augmentation with Diffusion Models
链接: https://arxiv.org/abs/2606.31603
作者: Nikolai Röhrich,Julian Gleißner,Ahmed H. A. Ibrahim,Silvan Mertes,Tobias Huber
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 13 pages, 7 figures
Abstract:Semantic segmentation models struggle with data sparsity and rare or visually diverse regions, e.g., dense regions or small objects in aerial or autonomous mobility data. While synthetic augmentation is an appealing solution, directly generating new labeled data risks misalignment of labels and generated pixels. Existing solutions to this problem often rely on external models, or employ coarse heuristics such as indiscriminately augmenting all foreground objects or entire backgrounds, which wastes capacity on uninformative pixels. To address this, we propose an uncertainty-guided synthetic context augmentation strategy that strictly preserves label validity and efficiently maximizes pixel informativeness per synthetic sample - no external guardrails required. Using a baseline segmenter’s predictive entropy, we identify uncertain semantic regions and inpaint only the complementary visual context. When fine-tuning the segmenter on this synthetic data, we compute the loss only over the original pixels, excluding inpainted regions. This focuses learning on the unmodified, uncertain regions while presenting them in novel contexts. We demonstrate substantial mIoU gains on Cityscapes, UAVID, and BDD100K with the largest gains on rare and difficult classes such as buses, trains, or (from the aerial perspective) cars. Our results demonstrate that uncertainty-guided context augmentation is a highly effective lever to improve segmentation performance on complex datasets, with code provided at this https URL.
[CV-60] oken-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning ICML2026
链接: https://arxiv.org/abs/2606.31599
作者: Kaitao Chen,Weiqian Zhao,Jiamin Wu,Qihao Zheng,Shangquan Sun,Chunfeng Song,Xiaosong Wang,Mu Zhou,Mianxin Liu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICML2026
Abstract:Vision-language models (VLMs) combining reinforcement learning (RL) ignite remarkable progress in multimodal reasoning, yet still struggle with medical images, which typically exhibit extremely sparse visual evidence to inform clinical decision-making. We recognize that pruning visual tokens outside the grounding region greatly enhances medical reasoning. However, a united RL framework for active visual token pruning (VTP) and medical multimodal reasoning remains unestablished. Here, we propose a dual-stream RL framework, ViToS, to fulfill token pruning and question answering. ViToS trains one policy model with two task branches, where one focuses on grounding while the other conducts token-sparse reasoning after VTP. Furthermore, we solve the coupled policy learning problem by introducing the cross-feedback sequential optimization, avoiding gradient conflict and facilitating convergence of the shared policy model. Evaluated on seven medical benchmarks, our method reduces visual tokens to 77% of the original sequence length while achieving a 108.27% relative performance on Lingshu-7B and 104.16% relative performance on HuatuoGPT-Vision-7B. Overall, ViToS delivers superior performance and inference speedup, establishing an efficient paradigm for medical multimodal reasoning.
[CV-61] DPPE: Rethinking Camera-Based Positional Encoding for Scaling Multi-View Transformers
链接: https://arxiv.org/abs/2606.31585
作者: Shun Kenney,Teppei Suzuki
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The remarkable scalability of Transformers has expanded their application to 3D computer vision, where camera-aware positional encoding is crucial for providing spatial cues in multi-view geometry. Recent advancements have established the practice of using camera parameters – such as extrinsics or projection matrices – as relative positional encoding into the query, key, and value vectors of the attention mechanism. However, when scaling up the training recipe of novel view synthesis (NVS) models with the camera-based positional encoding, we observe a significant issue: model performance stagnates in the late stages of training. In this paper, we investigate the cause of the performance bottleneck when scaling up and demonstrate that storing rotation and translation given by the positional encoding in the same dimensions of the value vector causes indeterminacy in their independent identification, hindering training scalability. To address this, we propose Decoupled Pose Positional Encoding (DPPE), a novel camera-based positional encoding that explicitly decouples rotation and translation. Extensive evaluations on NVS tasks demonstrate that DPPE enables stable long-term training even in scaled-up training setup. Furthermore, it exhibits superior generalization performance in extrapolation settings, such as handling an increased number of viewpoints and zoom-in scenarios. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.31585 [cs.CV] (or arXiv:2606.31585v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.31585 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-62] Localized Conformal Prediction for Image Classification with Vision-Language Models
链接: https://arxiv.org/abs/2606.31577
作者: Clément Fuchs,Tim Bary,Benoît Macq
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 2 figures, 3 tables, code availables, accepted to EUVIP 2025
Abstract:Conformal predictions have attracted significant attention in the field of uncertainty quantification, mainly because of their strong marginal coverage guarantees. Full conditional guarantee is not an attainable goal, a well known fact in conformal predictions literature. As a result, several approaches have tried to approximate this behavior by adapting the conformal sets of test-time samples according to their similarity to calibration examples. Although the latter has gained traction and shown impressive performances for regression problems, its application to image classification remains under-explored. We conduct an extensive benchmarking on natural image classification tasks with vision-language models (VLMs), using our open source implementation of a recent localized conformal prediction algorithm. We show that straightforward usage of the cosine similarity between test-time and calibration visual features, an intuitive choice for VLMs, is not sufficient to improve over the non-local baselines. In response, we propose a simple non-linear transformation of the cosine similarities, which conserves marginal coverage guarantees and achieves statistically significant mean set sizes reduction. Code is available at this https URL.
[CV-63] mperature Field Reconstruction of Tungsten Monoblock Divertor on EAST using Physics-aware Neural Operator Transformer
链接: https://arxiv.org/abs/2606.31574
作者: Zikang Yan,Xiao Wang,Qingquan Yang,Zhendong Yang,Gaoting Chen,Zehua Chen,Bo Jiang,Jin Tang,Guosheng Xu
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Accurate modeling of the divertor temperature field is essential for preventing material melting and damage and for extending the service life of fusion devices. However, conventional numerical methods, such as the Finite Element Method (FEM), are computationally expensive and therefore unsuitable for real-time applications. Therefore, a fast and generalizable method is required for real-time reconstruction of the divertor temperature field and subsequent real-time control. To address the above issue, we propose a Physics-aware Neural Operator Transformer (PNOT) to characterize the spatiotemporal evolution of the divertor temperature field. It models boundary heat-flux relations as a structured graph and employs graph attention to explicitly capture spatial physical dependencies. Inspired by physics-aware attention, we further develop a physics-aware neural operator module to aggregate query points with similar physical conditions via slicing and model heat diffusion, while a gradient-constrained Sobolev regularization loss enforces consistency between function values and their derivatives. Experimental results show that these physical constraints improve prediction accuracy while preserving physical consistency. The source code of this paper will be released on this https URL
[CV-64] Mitigating Positional Leakage in 3D Masked Autoencoders for Robust Representation Learning
链接: https://arxiv.org/abs/2606.31570
作者: Xu Yan,Huiqun Wang,Chen Wang,Lei Ren,Di Huang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Masked autoencoding has emerged as a prominent paradigm for self-supervised learning on 3D point clouds, achieving competitive performance across downstream tasks. Unlike its 2D counterpart, 3D masked autoencoding directly reconstructs spatial coordinates, making it inherently susceptible to positional leakage. In this work, we identify that the decoder in existing 3D MAE frameworks tends to over-rely on positional information, which weakens semantic representation learning and leads to suboptimal feature quality. To address this issue, we propose MPL-MAE, a masked point learning framework that mitigates positional over-reliance while enhancing the utilization of encoder features. Specifically, we introduce a recalibrated positional embedding module that suppresses metric-dominant coordinate signals while preserving geometric topology, together with a gated positional interface module that dynamically regulates positional injection during reconstruction. These designs promote a more balanced interaction between spatial priors and semantic features, yielding robust and informative representations. Extensive experiments across downstream tasks demonstrate that MPL-MAE consistently achieves competitive performance, validating its effectiveness. Code is available at this https URL.
[CV-65] AugSplat: Radiance Field-Informed Gaussian Splatting for Sparse-View Settings
链接: https://arxiv.org/abs/2606.31556
作者: Lorenzo Lazzaroni,Riccardo Bollati,Daniel Barath,Michael Niemeyer,Keisuke Tateno
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures
Abstract:Generating high-quality novel views at real-time frame rates remains a central challenge in 3D vision, particularly in sparse-view scenarios. Neural radiance fields have demonstrated robust reconstruction from limited observations, but their reliance on volumetric rendering leads to high computational cost and slow inference. In contrast, Gaussian Splatting methods achieve real-time rendering through rasterization, but their optimization is highly sensitive to the quality of the initial geometry. This sensitivity becomes especially problematic in sparse-view settings, where limited observations often lead to incomplete or noisy point-cloud reconstructions. In this work, we present AugSplat, a simple framework for improving Gaussian Splatting in sparse-view regimes using radiance-field-based view augmentation. We first train a radiance field on the sparse input views and use it to synthesize additional images from nearby novel viewpoints, increasing the effective view-space coverage available for supervision. These synthetic views are then used as auxiliary supervision during Gaussian Splatting optimization. We study two variants: Staged AugSplat, which uses synthetic views for an initial optimization phase before switching to real images, and Dual AugSplat, which jointly trains on real and synthetic views with a decaying synthetic loss weight. Experiments on sparse-view mip-NeRF 360 scenes show that AugSplat improves reconstruction quality over standard Gaussian Splatting. Staged AugSplat achieves the strongest average performance, while Dual AugSplat provides a closely performing formulation that keeps real-image supervision active throughout training, and both variants preserve real-time rendering at inference.
[CV-66] MV-GEL: Language-Driven Multi-View Geometric Entity Localization on Meshes
链接: https://arxiv.org/abs/2606.31533
作者: Kartik Bali,Roland Aydin
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Identifying and grounding precise geometric entities, such as edges, planar regions, and curved surfaces within 3D objects, is foundational to computer-aided design (CAD), robotic manipulation, and scientific simulation. Although modern Vision Language Models (VLMs) have advanced referring segmentation (RIS) in the image domain, extending such language-driven localization to structured 3D geometry is substantially harder. The 3D object appearance is highly sensitive to viewpoints; a single perspective may render a target entity clearly observable, while another may suffer from severe occlusion or foreshortening. In this work, we attempt to solve these challenges with MV-GEL (Multi-View Geometric Entity Localization), a framework for localizing fine-grained geometric entities on polygon meshes from natural language queries. Our key insight is that reliable CAD entity (i.e., faces, edges or solids) localization depends on selecting views that make the queried entity maximally interpretable. We introduce GELviews, a prompt-conditioned ranking module that prioritizes viewpoints based on language prompted observability of geometric CAD entities. Selected views are processed by a VLM-based reasoning segmentation backbone, and predicted masks are lifted to the corresponding meshes via geometry-aware ray casting. Our framework is completely CAD agnostic and relies only on 3D meshes. Experiments show up to a 1.7X improvement in face-level IoU and over 4.5X gains in edge-level F1 compared to vanilla baselines, substantially outperforming CLIP-based and random view sampling, particularly for thin and view-sensitive this http URL dataset, code and trained checkpoints are available at this https URL.
[CV-67] PRISM: Latent Composition Consistency for Single-Image Reflection Removal
链接: https://arxiv.org/abs/2606.31513
作者: Junseong Shin,Tae Hyun Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single-image reflection removal (SIRR) seeks to recover the transmission layer from a mixture corrupted by reflections – a severely ill-posed problem. Existing methods operate in pixel space, where the nonlinear sRGB formation model entangles the two layers and limits generalization. We observe that pretrained VAE latent spaces exhibit substantially lower coherence between image layers compared to pixel space, providing a more favorable working space for decomposition. Building on this finding, we propose \textbfPRISM (Pretrained-latent Reflection Image Separation Model), which reinterprets SIRR as a latent linear separation problem. Under an approximate additive formulation in latent space, PRISM learns a flow matching velocity field on a pretrained FLUX backbone that recovers both transmission and reflection in a single forward pass. To enforce robust disentanglement, we introduce a Latent Composition Consistency (LCC) strategy that constructs synthetic mixtures by swapping reflection latents across samples and enforces consistent decomposition via a cycle loss. We further propose a Layer Contrastive Separation (LCS) loss that promotes semantic separation between layers through patch-level contrastive learning, without requiring explicit reflection targets. Experiments on six benchmarks demonstrate that PRISM consistently outperforms state-of-the-art methods by significant margins, with strong generalization to in-the-wild images.
[CV-68] SimpleSearch-VL: A Simple Recipe for Multimodal Agent ic Deep Search
链接: https://arxiv.org/abs/2606.31504
作者: Ming Dai,Zhihong Lu,Jinjie Gu,Jiedong Zhuang,Yefeng Liu,Wankou Yang,Jian Wang,Chunhua Shen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report
Abstract:We present SimpleSearch-VL, an efficient, reliable, and practical framework for multimodal agentic search. Its core idea is to improve the agent’s own search-and-verification process rather than scaling data, tools, or auxiliary model components. For efficiency, Factorized Adaptive Rollout (FAR) improves sampling efficiency by forming more informative training groups while using redundant samples to mitigate long-tail latency and expose hard samples. For reliability, SimpleSearch-VL performs evidence-verified reasoning, explicitly using chain-of-thought verification to assess the relevance of retrieved visual and textual cues to the original context. For practicality, SimpleSearch-VL keeps a lightweight tool interface and performs webpage self-summary within the agent, requiring no additional external dependencies. With only 5K supervised tool-interleaved trajectories and 2K RL data, SimpleSearch-VL improves Qwen3-VL agentic baselines by 15.8 and 16.0 average points for the 8B and 30B-A3B variants, respectively. The SimpleSearch-VL-30B-A3B model further achieves performance competitive with agentic Gemini-3-Pro.
[CV-69] Fully Automated High-Precision Segmentation of Retinal Atrophy and Ellipsoid Zone Thickness in OCT: A Reliable Tool for Real-World GA Monitoring
链接: https://arxiv.org/abs/2606.31502
作者: Wolf-Dieter Vogl,Hlynur Skulason,Oliver Leingang,Ursula Schmidt-Erfurth,Amir Sadeghipour,Ariadne Whitby
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 31 pages, 6 tables, 7 figures, contain 3 supplemental figures and 2 supplemental tables
Abstract:Geographic atrophy (GA) secondary to age-related macular degeneration (AMD) requires precise monitoring of relevant structural biomarkers to assess disease stage, progression, and treatment response. This paper presents a fully automated, deep learning-based framework for the high-precision, pixel-wise segmentation of key biomarkers in optical coherence tomography (OCT) imaging: retinal pigment epithelium (RPE) loss, ellipsoid zone (EZ) loss, and EZ thinning. The proposed pipeline uses three specialized semantic segmentation models to delineate RPE loss, EZ boundaries (including interruptions), and Bruch’s membrane. To ensure robustness and generalizability, the models were developed on a diverse dataset of 298 SD-OCT volumes representing the full phenotypic spectrum of AMD (GA:222, intermediate AMD: 40, neovascular AMD: 17, healthy: 19) and validated on an independent external dataset (n=43). The comprehensive evaluation was further strengthened using additional datasets to assess repeatability, inter-reader reliability, the impact of B-scan density on measurement accuracy, and subgroup performance stratified by lesion size. Results demonstrated high segmentation accuracy (Dice RPE loss: 0.88, Dice EZ loss: 0.87, Pearson’s r 0.99). Total EZ thickness measurements exhibited a sub-pixel average deviation of 2.15 \mu m , and segmentation reliability was confirmed by a strong reproducibility score (ICC 0.98). By accurately and consistently quantifying outer photoreceptor degeneration and RPE loss, this fully automated framework provides a highly reliable tool for GA assessment in both clinical trials and routine real-world ophthalmic care.
[CV-70] HVPNet: A Bio-Inspired Network for General Salient and Camouflaged Object Detection
链接: https://arxiv.org/abs/2606.31496
作者: Jiawei Xu,Qiangqiang Zhou,Zhouping Li,Yanjiao Shi,Yugen Yi,Jiacong Yu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In recent years, most research on multimodal salient object detection (SOD) and camouflaged object detection (COD) typically aims to improve performance through complex cross-modal feature fusion and decoding structures. However, this approach leads to an excessively large model parameter scale and often fails to deliver satisfactory detection performance due to structural redundancy. In contrast, the human visual process is able to efficiently perform salient and camouflaged object identification without such complex structures. This contrast raises an important question: Can we draw conceptual inspiration from the human visual process to achieve a simpler modeling strategy, and still realize accurate and efficient object detection? To answer this question, we propose HVPNet, a simple yet general bio-inspired computational architecture. Drawing on the multi-layered information integration of the retina as a conceptual metaphor, we designed a Retinal Integration Module (RIM), which effectively integrates multimodal features through a level-specific multi-stage integration strategy. To fully exploit these features, we further design a cortical decoder (CD) that breaks down the decoding process into low- and high-level visual stages, abstracting the hierarchical processing in the human visual cortex. Benefiting from these designs, HVPNet can readily extend to seven tasks across four modalities. Without bells and whistles, it establishes an excellent accuracy-efficiency trade-off across 22 datasets spanning these seven tasks. Our code is available at this https URL.
[CV-71] DrivingDepth: Sparse-Prompted Pixel-wise Scale Correction for Driving Depth Estimation
链接: https://arxiv.org/abs/2606.31488
作者: Chi Huang,Wenhao Zhang,Hang Yin,YuAn Wang,Hao Li,Bosheng Wang,Xun Sun,Liang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dense depth estimation for autonomous driving faces a geometry-scale conflict: depth foundation models deliver pixel-aligned dense visual geometry without reliable metric scale, while projected LiDAR provides metric anchors that are sparse, noisy, and misaligned with image structures. Existing sparse-prompted methods incorporate LiDAR by regenerating depth from scratch, overriding the foundation model’s coherent geometry and producing structural artifacts on visually continuous surfaces. Our key insight is that foundation models already capture geometrically coherent relative depth; no additional surface structure learning is required-only a per-pixel scale factor mapping relative geometry to metric coordinates. Based on this, we propose DrivingDepth, which treats sparse LiDAR as geometric prompts that locally calibrate a frozen foundation prior through residual pixel-wise scale correction, preserving dense visual geometry by construction. On nuScenes with 4-frame surround-view input, DrivingDepth achieves an AbsRel of 11.19 and an EdgeCR of 5.741, outperforming MapAnything (11.99/1.914) by simultaneously delivering SOTA metric accuracy and geometric consistency.
[CV-72] One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution
链接: https://arxiv.org/abs/2606.31478
作者: Jie Ma,Binfei Chu,Jie Gao,Jinlu Zhang,Yiwei Ma,Yi Tan,Jiayi Ji,Xiaoshuai Sun,Rongrong Ji
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous research agents can now draft hypotheses, write code, run experiments, and produce papers, but they remain brittle when experiments fail. Under the prevailing paradigm, failure recovery is usually delegated to a single free-form reflection: a rich trajectory of metrics, logs, and design choices is compressed into one verbal critique, which often leads either to localized trial-and-error or to hard pivots that discard useful context. We propose SAGE, a Self-correcting, Autonomous, Grounded Experimenter, to tackle this failure-recovery bottleneck. Its core mechanism, Multi-Hypothesis Failure Attribution (MHFA), treats recovery as a structured causal diagnosis. By analyzing dynamic trajectory features, MHFA systematically generates multiple evidence-grounded explanations for a failure, independently evaluates their severity, and deterministically routes the verified root cause to the correct intervention level (hypothesis, experimental design, or implementation). To guarantee scientific honesty, SAGE further employs a grounded reporting mechanism that explicitly constrains drafted results to actual measured values, redacting hallucinated numbers. On a 12-topic, 5-domain benchmark, SAGE increases metrics-bearing outputs from 42% to 92% over a reflection baseline, improves artifact quality from 5.00 to 6.75/10, and blindly outscores AI-Scientist-v2 (52.0 vs. 48.2), with gains concentrated in code development and execution. While fully autonomous scientific writing and generating conference-ready papers remain notoriously difficult open problems for the entire field, SAGE successfully produces significantly more reliable and higher-quality scientific artifacts. Ultimately, by coupling structured recovery with explicit grounding constraints, SAGE significantly outperforms monolithic reflection paradigms, establishing a highly trustworthy foundation for future autonomous research.
[CV-73] hink While You Map: Asynchronous Vision-Language Agents for Incremental 3D Scene Graphs ECCV2026
链接: https://arxiv.org/abs/2606.31471
作者: Deniz Bickici,Michael Pabst,Shohei Mori,Dieter Schmalstieg
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL
Abstract:Open-vocabulary 3D scene graph methods typically operate in two stages: first reconstruct, then enrich with vision-language models, leaving the graph unqueryable during exploration. We argue that this sequential coupling is unnecessary and propose an asynchronous architecture in which lightweight online mapping runs concurrently with heavyweight semantic refinement. A probabilistic voxel-based backbone maintains stable object identities incrementally, while background VLM agents progressively enrich the graph. This framework resolves duplicate object tracks through semantic loop closure, attaches fine-grained visual attributes and derives spatial relations between objects. A multi-target frame scheduler amortizes VLM cost by selecting a small set of informative frames that jointly cover multiple targets. The resulting scene graph is queryable during exploration and grows in semantic richness over time. Our method matches or outperforms existing open-vocabulary 3D scene graph methods on semantic segmentation (ScanNet, Replica) and surpasses the prior state-of-the-art across three visual grounding benchmarks (Sr3D+, Nr3D, ScanRefer) by 15.3 to 18.8 A@0.25. Project page: this https URL
[CV-74] AeroVerse-SatAgent : UAV-Satellite Collaborative Spatial Reasoning Inspired by the Dual Visual Pathway Theory of Cognitive Neuroscience
链接: https://arxiv.org/abs/2606.31467
作者: Wenyi Zhang,Fanglong Yao,Youzhi Liu,Peng Hu,Zhengqiu Zhu,Chen Gao,Xian Sun,Kun Fu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 21 pages, 10 figures and 8 tables
Abstract:With the rapid advancement of aerospace embodied intelligence, enabling Unmanned Aerial Vehicles (UAVs) to autonomously understand and reason about complex environments has become increasingly important. However, existing UAV-based spatial reasoning approaches face critical limitations: single-view perception renders them vulnerable to occlusions and perspective distortions, while most VLMs lack explicit geometric modeling, relying on semantic cues and yielding inconsistent reasoning under viewpoint and scale variations. To address these challenges, we propose SatAgent, a UAV-Satellite collaborative spatial reasoning model inspired by the dual-pathway mechanism of the human visual system. By jointly leveraging satellite and UAV perspectives, SatAgent enables robust, accurate reasoning in complex urban environments. We first introduce a Geometric-Aware 3D Reconstruction Encoder that elevates 2D UAV features into explicit 3D spatial representations. Next, we design a multi-view topology-semantic alignment module integrating cross-view features within a unified BEV coordinate system. We further introduce a multi-view consistency loss encouraging viewpoint-invariant representations. Finally, we construct SatAgent-SR130K, the first large-scale UAV-Satellite collaborative multi-view spatial reasoning dataset. Experiments show SatAgent outperforms state-of-the-art general-purpose foundation models and specialized spatial reasoning models by 25.91% and 11.69%, respectively, across diverse tasks, achieving particularly high accuracy in complex geometric relationship reasoning.
[CV-75] owards a foundational model for recognising diastematic Gregorian notation
链接: https://arxiv.org/abs/2606.31454
作者: Daniel Kurek,Jan Hajič jr
类目: Computer Vision and Pattern Recognition (cs.CV); Digital Libraries (cs.DL)
备注:
Abstract:Optical recognition of Gregorian notation has recently been attempted with end-to-end methods, with four datasets introduced. However, each of these datasets is in a different encoding. We design a common encoding based on the S-GABC proposal, convert all four datasets to this common encoding, and train a shared end-to-end foundational model for diastematic Gregorian notation that establishes a new state of the art across all four datasets.
[CV-76] mporal Training Strategies for Left Atrium and Left Atrial Appendage Segmentation in Dynamic Contrast 4DCT
链接: https://arxiv.org/abs/2606.31444
作者: David Montalvo-García,Lauren Severance,Elliot R. McVeigh,María J. Ledesma-Carbayo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at CinC 2026
Abstract:Dynamic contrast-enhanced cardiac CT enables time-resolved analysis of contrast filling and washout in the left atrium (LA) and left atrial appendage (LAA), with potential applications for assessing blood stasis in atrial fibrillation (AF). Accurate segmentation across all frames is required for such analysis but is challenging due to large temporal contrast variations and the use of a single annotation per registered sequence. This creates a trade-off between training for robustness and limiting label noise. In this study, we investigate how temporal training-set design affects nnUNet-based segmentation of the LA and LAA in dynamic 4DCT. We compare training using a minimal two-frame dataset reflecting standard clinical practice, a physiologically selected subset of frames, and the full 27-frame sequence. We further evaluate the impact of foreground-based normalization. Training with all frames yielded the best performance in early low-contrast phases. However, the physiologically selected subset achieved comparable performance from the filling phase onward. Applying normalization parameters derived from the full dataset improved performance of reduced datasets in low-contrast frames, but did not fully close the gap. These findings highlight the importance of temporal diversity in training data for robust segmentation in dynamic CT, while indicating that carefully selected frame subsets may provide an effective trade-off between performance and efficiency for downstream applications.
[CV-77] No Prompt No Leaks: A Robust Generative Steganography Framework via Prompt-Free Diffusion
链接: https://arxiv.org/abs/2606.31427
作者: Jingwen Cai,Fen Xiao,Shuhua Deng,Xieping Gao
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Generative image steganography synthesizes stego images directly from secret information to achieve inherent security advantages. Latent Diffusion Models (LDMs) have recently emerged as a fundamental image steganography framework that modulates secret latent representations with text prompts. Limited by the inflexibility of text prompts, these methods still struggle to generate high-quality stego images and accurately recover secret images. In this work, we propose a prompt-free diffusion image steganography framework that integrates style semantic priors to control more robust and reliable stego image generation. Specifically, a Cascaded Affine Coupling Module (CACM) establishes a bijective, deterministic mapping between a secret image and its latent representation. Then, style semantics are integrated into the diffusion process to control latent representation and ensure visual imperceptibility in the generated stego images. To mitigate trajectory deviations stemming from the unconditioned reverse process, a predictor-corrector mechanism is introduced to iteratively refine the generation trajectory via feedback from the current and predicted next states. Extensive experimental results show that the proposed method achieves competitive performance compared to state-of-the-art methods in terms of security, secret image reconstruction accuracy and controllability.
[CV-78] mporal Preservation over Processing: Diagnosing and Designing Spatiotemporal Single-Stage Video Detectors
链接: https://arxiv.org/abs/2606.31421
作者: Karam Tomotaki-Dawoud,Anna Hilsmann,Peter Eisert,Sebastian Bosse
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Single-stage video object detectors are increasingly deployed in time-critical applications, yet it remains unclear whether these models genuinely reason over temporal context or merely exploit a single informative frame-a gap hidden by standard metrics, which reward correct predictions regardless of how they are reached. We address this from two complementary directions: first, we propose TemporalLens, a model-agnostic diagnostic framework probing temporal dependence through controlled perturbations, structured occlusions, temporal shuffling, redundancy injection, and resolution degradation, revealing whether a detector actually uses information across time. Applied to stacked-frame 2D detectors and our YOLO-3D architecture, it exposes behavioural differences invisible to mAP: stacked 2D models collapse when the target frame is removed, while spatiotemporal models recover predictions from earlier frames, a signature of real temporal reliance. Second, we detail YOLO-3D, a modular real-time spatiotemporal detector built on YOLOv8, and show that simply preserving temporal depth through the backbone is the dominant performance driver (+3.7 pp mAP@50 at 32 frames averaged across scales). Together, the diagnostics and architecture turn “does this detector reason over time?” into a measurable, actionable question.
[CV-79] Resolving superposition in AI for interpretability and cross-modal alignment in patient-neuronal images NEURIPS2026
链接: https://arxiv.org/abs/2606.31394
作者: Jisung Park,Seohyeon Kang,Daeun Yoo,Eunsu Lee,Seoin Cho,Wooyeop Choi,Ian Choi,James R. Evan,Daesoo Kim,Sonia Gandhi,Minee L. Choi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 10 pages, 7 figures (plus 14 in appendix), 1 table, NeurIPS 2026 preprint
Abstract:Artificial intelligence is transforming our capability to solve biological challenges. In dimensionality bottleneck regimes exacerbated by high-dimensional biological data, Neural networks force distinct concepts into the lower dimensions known as superposition. Although this superposition is widely known to hinder interpretability, its impact on corrupting the geometry of latent spaces remains critically overlooked. Here, we utilized sparse autoencoders (SAEs) trained on over 100,000 multiplexed images of patient-derived Parkinson’s disease and healthy neurons to resolve superposition. This approach bypasses the mathematical non-uniqueness of feature attribution by shifting to interpretable latent representation analysis. We theoretically and empirically demonstrate that superposition contaminates representational metric spaces, and thereby SAEs successfully recover geometric fidelity. By treating these geometrically purified representations as single-cell state vectors, we adapted single-cell RNA sequencing (scRNA-seq) data analysis methodologies directly to the image domain. Finally, we introduce GW-map, utilizing Gromov-Wasserstein optimal transport to align these image representations with authentic scRNA-seq data \emphde novo. This coupling reconstructs hierarchical neuronal pathology pathways such as Calcium-AIS scaffold, without reference spatial transcriptomics, establishing a scalable foundation for spatial biology. Code is available at this https URL
[CV-80] One Video One World: Turning Monocular Video into Physical 4D Scenes ECCV2026
链接: https://arxiv.org/abs/2606.31388
作者: Junhao Chen,Boran Zhang,Mingjin Chen,Henghaofan Zhang,Saining Zhang,Congcong Zhu,Hao Zhao,Ruqi Huang,Zhihao Li,Yufei Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project Page: this https URL
Abstract:We introduce \textbfOVOW, the first training-free system that reconstructs \emphinstance-level, simulation-ready 4D mesh scenes from a single monocular video. Recent 4D reconstruction achieves impressive rendering quality, but its outputs (\eg, implicit fields, Gaussian primitives, or point clouds) lack the watertight topology, instance separation, and standardized physical interfaces required by physics simulators and embodied AI. OVOW closes this gap with a four-stage pipeline: a vision-language model discovers, labels, and motion-classifies all instances; category-aware reconstruction yields per-instance meshes for rigid objects and topology-consistent mesh sequences for deformable ones; an iterative render-match-optimize procedure recovers metric scale and 6-DoF pose trajectories; and physics-grounded assembly enforces ground contact and inter-object support. Crucially, we model all motion, rigid and non-rigid, through direct vertex deformation without category-specific priors or skeleton rigging, producing watertight mesh scenes ready for downstream physics simulation and editing. We further establish the first benchmark for \emphstructured Video-to-4D evaluation, with metrics for geometric correctness, instance separation, and physical plausibility beyond visual fidelity; the same pipeline doubles as a scalable engine for \emphsynthesizing paired video-to-4D simulation data for future 4D world models and embodied AI. Across two synthetic benchmarks (static and 4D), OVOW attains the best overall layout and geometry accuracy and the lowest photometric and semantic error among all baselines, and on monocular video runs one to two orders of magnitude faster than the baselines, while downstream physics simulation confirms its physical stability.
[CV-81] MS-Resampler: Multi-Scope Visual Resampling for Efficient Multimodal LLM s
链接: https://arxiv.org/abs/2606.31383
作者: Zhongyang Li,Yaqian Li,Faming Fang,Rinyoichi Takezoe,Zi-Hao Bo,Cheng Qian,Mo Guang,Guixu Zhang,Kaiwen Long
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal large language models (MLLMs) typically employ resampling-based projectors to transform dense visual features into a compact token sequence for language modeling. Most existing resamplers adopt a single, fixed aggregation scope via global cross-attention, which can blur fine-grained local evidence and limit the ability to capture both local details and global context within a fixed token budget. In this work, we propose MS-Resampler, a multi-scope visual resampling framework for MLLMs. MS-Resampler instantiates multiple scope-specific resamplers by injecting explicit spatial scope priors into the resampling attention, enabling each branch to aggregate visual information at a particular granularity from local to global. The outputs of these scope-specific resamplers are then adaptively fused to produce the final visual representations for language modeling. Extensive experiments on ten public multimodal benchmarks show that MS-Resampler consistently improves visual understanding and multimodal reasoning over conventional single-scope resamplers, while introducing only minimal computational overhead.
[CV-82] MAPE: Defending Against Transferable Adversarial Attacks Using Multi-Source Adversarial Perturbations Elimination
链接: https://arxiv.org/abs/2606.31378
作者: Xinlei Liu,Jichao Xie,Tao Hu,Peng Yi,Yuxiang Hu,Shumin Huo,Zhen Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages
Abstract:Neural networks are vulnerable to meticulously crafted adversarial examples, leading to high-confidence misclassifications in image classification tasks. Due to their consistency with regular input patterns and the absence of reliance on the target model and its output information, transferable adversarial attacks exhibit a notably high stealthiness and detection difficulty, making them a significant focus of defense. In this work, we propose a deep learning defense known as multi-source adversarial perturbations elimination (MAPE) to counter diverse transferable attacks. MAPE comprises the single-source adversarial perturbation elimination (SAPE) mechanism and the pre-trained models probabilistic scheduling algorithm (PPSA). SAPE utilizes a thoughtfully designed channel-attention U-Net as the defense model and employs adversarial examples generated by a pre-trained model (e.g., ResNet) for its training, thereby enabling the elimination of known adversarial perturbations. PPSA introduces model difference quantification and negative momentum to strategically schedule multiple pre-trained models, thereby maximizing the differences among adversarial examples during the defense model’s training and enhancing its robustness in eliminating adversarial perturbations. MAPE effectively eliminates adversarial perturbations in various adversarial examples, providing a robust defense against attacks from different substitute models. In a black-box attack scenario utilizing ResNet-34 as the target model, our approach achieves average defense rates of over 95.1% on CIFAR-10 and over 71.5% on Mini-ImageNet, demonstrating state-of-the-art performance.
[CV-83] Domain Adaptive Object Detection via Dual-Stream Bilevel-Cycle Optimization
链接: https://arxiv.org/abs/2606.31373
作者: Yannan Chen,Wenqiang Wang,Ruoyu Chen,Jiancheng Wang,Mingbo Yang,Yaowei Wang,Wei Wang,Xiaochun Cao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Cycle self-training (CST) breaks the shared classifier assumption of the standard self-training framework, which is effective for unsupervised domain adaptation and exploits unlabeled target data by training with target pseudo-labels. CST introduces a target classifier and employs an inner-outer loop updating strategy, addressing the issue of unreliable pseudo-labels and enabling pseudo-labels to generalize across domains. Despite its success in image classification, extending CST to object detection faces three main challenges. First, the upper bound of CST in object detection is constrained by three types of unreliable pseudo-labels, such as classification error alone, localization error alone, and their combination. Second, since object detection involves detecting multiple target objects, directly applying CST leads to training insta bility. Third, a wider numerical range of regression coordinates leads to exploding losses. To this end, we apply CST to both classification and regression and propose the Dual-Stream Bilevel-Cycle Optimization framework. Specifically, we construct CST upon Mean Teacher to prevent training instability and use extra normalization to map the regression bounding box into a standardized space, effectively addressing exploding losses. Also, we provide a theoretical derivation of the regression bound. Extensive experiments across four cross domain standard scenarios demonstrate that our framework achieves considerable results.
[CV-84] Evidence Triangulation for Multimodal Fact-Checking in the Wild
链接: https://arxiv.org/abs/2606.31367
作者: Stefanos-Iordanis Papadopoulos,Zacharias Chrysidis,Christos Koutlis,Symeon Papadopoulos,Panagiotis C. Petrantonakis
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of multimedia content on social platforms has fueled multimodal misinformation, where images are used to reinforce false claims. Consequently, Multimodal Fact-Checking (MFC) has emerged as an increasingly important research area. However, current progress is hindered by a reliance on synthetic training data and curated benchmarks that fail to capture the complexity of in-the-wild data. Furthermore, existing detection models rely on restricted intra-modality consistency or unconstrained all-to-all fusion, failing to capture nuanced relations between posts and external evidence. To address these limitations, we introduce X-POSE, a benchmark of real-world, community-annotated multimodal posts from X (formerly Twitter), augmented with full-length news articles retrieved via VLM-optimized search. Additionally, we propose TRENT, a novel MFC model that performs evidence triangulation using three parallel cross-attention streams alongside a relational fusion mechanism that explicitly models entailment and contradiction. Extensive evaluations demonstrate that TRENT consistently outperforms state-of-the-art specialized models and commercial VLMs. The code, prompt templates, and dataset are available at this https URL
[CV-85] Language-Assisted Super-Resolution from Real-World Low-Resolution Patches
链接: https://arxiv.org/abs/2606.31363
作者: Joonkyu Park,Kyoung Mu Lee
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages
Abstract:Single image super-resolution aims to reconstruct high-resolution (HR) images from low-resolution (LR) inputs. Training SR models typically requires paired HR-LR data, which is difficult to obtain in reality. As a result, most methods synthesize LR images by artificially degrading HR images with handcrafted kernels or camera ISP adjustments. However, these synthetic degradations fail to capture the complexity of real LR images, leading to poor generalization in practice. To address this, we observe that even within a single high-quality image, regions at different depths exhibit varying resolutions, where distant regions act as LR patches and closer ones as HR patches. This allows the extraction of real, degradation-induced LR patches from real images. Since these LR patches lack paired HR counterparts, we propose LA-SR (Language Assistant for SR), a novel framework for unpaired SR. The key idea of LA-SR is to redefine unpaired SR in the language space, using vision-language models to bridge the LR-HR gap. LA-SR projects images into a semantically rich space representing both content and quality, and applies two language-guided losses: linguistic content loss to preserve semantic fidelity, and linguistic quality loss to enhance perceptual realism. With this alignment, LA-SR effectively super-resolves real LR inputs, producing realistic outputs that overcome the limitations of synthetic-data-trained methods.
[CV-86] RCL-Mamba: A Dual-domain State Space Model for Measurement-oriented Image Restoration in Rotational Sparse-View Scanning Computed Laminography
链接: https://arxiv.org/abs/2606.31353
作者: Xuyang Duan,Genyuan Zhang,Zhenjiang Dong,Chuandong Tan,Zihao Wang,Junyao Wang,Fenglin Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rotational Scanning Computed Laminography (RCL) is widely utilized for the Non-Destructive Testing(NDT) of large planar components. However, to facilitate rapid inspection, continuous sparse-view scanning is often employed, where the angular integration effect during exposure induces rotational blur in the projection domain. Furthermore, the data incompleteness inherent in sparse sampling manifests as sparse artifacts in the reconstructed image domain. To address these cross-domain degradations, this paper proposes RCL-Mamba, a measurement-oriented dual-domain State Space Model (SSM)-based image restoration network. The framework adopts a cascaded joint processing strategy: it first corrects the rotational blur in the projection domain and subsequently suppresses the sparse artifacts in the image domain. Additionally, we design a Mamba-CNN dual-branch module to adaptively balance large-scale blur correction with local detail recovery. Evaluations on both simulated datasets and real-world Printed Circuit Board (PCB) scans demonstrate that RCL-Mamba outperforms existing baselines in blur removal, artifact suppression, and structural preservation. Line-profile-based structural measurement further verifies that the proposed method better preserves via/pad boundaries and slender trace profiles. Crucially, by reducing the required scanning views from 512 to 64, our method enhances inspection efficiency by approximately 8-fold without compromising reconstruction quality, offering a robust measurement-oriented restoration solution for high-throughput RCL inspection with improved structural measurement fidelity.
[CV-87] Patient-Level Elbow Abnormality Detection: Leakage-Aware Evaluation of Learned Preprocessing Calibration and Triage-Oriented Operating Points
链接: https://arxiv.org/abs/2606.31348
作者: Ahmed Sallam,Ahmet Kaplan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Conference paper
Abstract:In this study, we examine learned preprocessing pipelines in the context of triage-oriented orthopedic abnormality detection task using elbow radiographs from MURA dataset. The evaluation focuses on patient-level detection of musculoskeletal abnormalities under a leakage-aware protocol. We compare multiple preprocessing pipelines, with and without a lightweight DnCNN module as a learned preprocessing component, to assess their impact on discrimination and calibration. Performance is assessed using discrimination metrics (AUROC, PR-AUC), calibration measures (ECE, Brier score), and validation-selected operating point analysis targeting high specificity. Results show that differences across preprocessing strategies are modest and configuration-dependent, with no consistent discrimination advantage over the raw-input DenseNet121 baseline. The raw and diverse inputs combined with the DnCNN front-end showed reduced ECE and Brier score, while CLAHE combined with DnCNN did not improve calibration. Overall, the results suggest that under patient-level evaluation, preprocessing gains are modest and configuration-dependent; the raw-input DenseNet121 baseline remains competitive throughout, and no tested preprocessing strategy produced a consistent discrimination advantage across all metrics.
[CV-88] Bridging Video Understanding and Generation in a Unified Framework
链接: https://arxiv.org/abs/2606.31326
作者: Yuqi Wang,Runyi Li,Ruoyu Feng,Renjie Chen,Wenfeng Lin,Mingyu Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: technical blog
Abstract:Recently, unified image generation and understanding have been extensively explored. However, extending such unified modeling paradigms to the video domain remains largely underexplored. A central challenge is that video understanding favors compact, discriminative semantic representations, whereas video generation requires dense signals that preserve visual details and temporal coherence. Videos naturally capture both spatial semantics and temporal dynamics, making them a more suitable modality for unified multimodal modeling compared to static images. In this paper, we propose Vega, a unified framework that bridges video understanding and generation. Vega leverages a shared vocabulary to jointly model text and visual representations and employs a hybrid architecture combining autoregressive (AR) prediction with diffusion-based rendering. Specifically, the AR model focuses on predicting semantically meaningful visual tokens for keyframes, providing a structured representation that guides the diffusion module in rendering dense, high-resolution video frames. Extensive experiments demonstrate that Vega achieves strong performance on video generation benchmarks such as VBench and video understanding benchmarks like VideoMME.
[CV-89] Accelerated Likelihood Maximization for Diffusion-based Versatile Content Generation ECCV2026
链接: https://arxiv.org/abs/2606.31323
作者: Hyunsoo Lee,Inwoo Hwang,Young Min Kim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026. Project website: this http URL
Abstract:Generating diverse, coherent, and plausible content from partially given inputs remains a fundamental challenge for diffusion models. Existing approaches face clear limitations: training-based approaches offer strong task-specific results but require costly computation, and they generalize poorly across tasks. Training-free approaches offer better efficiency, but they do not explicitly optimize over unobserved variables, leading to globally inconsistent results. To address these limitations, we introduce Accelerated Likelihood Maximization (ALM), a novel training-free sampling strategy integrated into the reverse diffusion process that significantly extends the applicability of diffusion models beyond simple generation tasks. Unlike previous methods that implicitly influence missing regions through pre-generated region constraints, we directly optimize the unobserved region during the sampling process, enabling globally coherent and plausible generation. Furthermore, we incorporate an acceleration strategy that significantly improves computational efficiency without sacrificing performance. Experimental results demonstrate that ALM consistently outperforms state-of-the-art methods in various data domains and tasks, establishing a powerful paradigm for versatile content generation.
[CV-90] Wavelet-Optimized Pseudo-3D Accelerated Diffusion Model for Truncated Computed Laminography
链接: https://arxiv.org/abs/2606.31318
作者: Genyuan Zhang,Junyao Wang,Chuandong Tan,Fenglin Liu,Yongning Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 11 figures, 4 tables. Under review at NDTE International
Abstract:Computed Laminography (CL) is a key technology for the nondestructive testing of large plate-shaped objects. However, field-of-view (FOV) limitations inevitably lead to truncation of projected data, an ill-posed inverse problem that causes severe reconstruction artifacts. Existing deep learning methods typically rely on 2D architectures that lack rigorous data consistency constraints. Furthermore, they conventionally confine artifact removal strictly to the FOV, discarding potentially recoverable information outside it. To overcome these limitations, we first introduce a comprehensive CL FOV analysis, categorizing the space into data-complete, data-incomplete, and data-free regions. By extending our reconstruction target to encompass the data-incomplete region, we significantly expand the effective imaging range and enhance scanning efficiency. To achieve this, we propose a novel wavelet-optimized pseudo-3D accelerated diffusion model for CL truncation reconstruction (CL-DM). Our method utilizes a standard 2D diffusion model for slice aggregation, combined with a 3D model-based iterative reconstruction (MBIR) method to ensure strict data consistency. To mitigate inter-slice discontinuities, we introduce wavelet regularization along the z-direction, paired with a translation-invariant (TI) mechanism and a low-frequency preservation strategy. Finally, we introduce a 3D fast sampling architecture, significantly accelerating inference speed. Extensive simulations and real-world experiments demonstrate that CL-DM is superior in effectively eliminating truncation artifacts and restoring high-fidelity, continuous 3D structures.
[CV-91] Deep Spectral Models for Robust Dental Shape Generation
链接: https://arxiv.org/abs/2606.31293
作者: Tibor Kubík,François Guibault,Michal Španěl,Hervé Lombaert
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) this https URL
Abstract:Accurate modeling of dental crown morphology is fundamental for diagnosis, orthodontic planning, and computer-aided restoration design. However, datasets suitable for training such models are typically limited in size. We present ToothForge, a deep spectral generative framework that models dental crown geometries from compact, intrinsic representations. By operating in the spectral domain, ToothForge learns a latent manifold of 3D tooth shapes through synchronized spectral embeddings, ensuring consistent modeling across samples with varying connectivity. Spectral synchronization mitigates the instability of Laplace-Beltrami eigenbases and enables efficient learning in a low-dimensional space. The framework is thoroughly evaluated through robustness analysis, ablation studies, and benchmarking against PCA-based statistical shape models and point-based generative frameworks. Results show that synchronized spectral modeling achieves reconstruction and generative performance comparable to or exceeding spatial approaches, while maintaining compactness and geometric interpretability. Together, the compact synchronized coefficients and low-dimensional learning space make the framework particularly suitable for limited datasets, as often encountered in dental and medical domains, and applicable in real-world scenarios where guaranteeing consistent connectivity across shapes from various clinics is unrealistic.
[CV-92] Editing Everything Everywhere All at Once ECCV2026
链接: https://arxiv.org/abs/2606.31278
作者: Fabio Quattrini,Carmine Zaccagnino,Enis Simsar,Marta Tintoré Gazulla,Rita Cucchiara,Alessio Tonioni,Silvia Cascianelli
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Editing multiple elements of an image in a single forward pass is a practical alternative to multi-turn image manipulation, offering improved efficiency and potentially better harmonization. However, when several instructions target different regions, semantic interference often leads to attribute leakage and poor edit disentanglement, especially as the number of edits increases. In this work, we propose MICE (Multi-Instance Concurrent Editing), a training-free strategy for scalable multi-instance image editing with Multimodal Diffusion Transformers. MICE modifies the additive bias of joint attention to regulate interactions between instance-specific edit instructions, latent, and context tokens identified via user-provided segmentation masks. Specifically, MICE allows intra-instance attention, penalizes interactions between neighboring region tokens, and suppresses unrelated cross-instance attention. As a result, our method enforces attribute binding while preserving global visual consistency. We evaluate MICE on LoMOE-Bench and introduce MICE-Bench, a more challenging benchmark with an average of 8.5 concurrent edits per image. The experiments demonstrate that our approach outperforms strong baselines and recent competitors in terms of visual quality preservation and faithfulness to the editing instructions.
[CV-93] CLIMB: Centroid-Based Hierarchical Memory for Online Continual Self-Supervised Learning
链接: https://arxiv.org/abs/2606.31275
作者: Julien Lefebvre,Stefan Duffner,Mathieu Lefort
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at CoLLAs 2026 conference
Abstract:Online Continual Self-Supervised Learning (OCSSL) aims to learn representations from a continuous stream of unlabeled data, without knowledge of task boundaries and under memory constraints. Existing methods rely either on replay buffers that exploit latent space structure, or on regularization alone. We present CLIMB (Continual Learning with Intelligent Memory Bank), which combines both simultaneously. Our method introduces a hierarchical centroid-based memory, bounded in total number of stored images, combined with knowledge distillation on replayed examples to limit representation drift. The memory groups similar images into centroids, providing hard-to-discriminate examples for contrastive learning while covering the diversity of observed distributions. Experiments on Split CIFAR-100 and Split ImageNet-100, on standard benchmarks from the state-of-the-art as well as a new protocol with irregular task distributions show that CLIMB outperforms state-of-the-art OCSSL methods.
[CV-94] WarpHammer: Densifying Scene Warps with 3D Object Priors for Extreme View Synthesis
链接: https://arxiv.org/abs/2606.31258
作者: Michael Green,Gavriel Habib,Dvir Samuel,Tal Berkovitz Shalev,Issar Tzachor,Rami Ben-Ari,Or Litany
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Projection-conditioned novel view synthesis (NVS) warps an explicit 3D reconstruction of the input view into the target camera and conditions a generator on the warped rendering. This works well for small viewpoint changes but degrades sharply under large orbital motion: the warp becomes sparse around the orbited object, where hidden surfaces dominate the new view and mirror-like artifacts emerge, causing the generator to lose both pixel content and the implicit camera cue carried by the warp. We introduce WarpHammer, a training-free framework that resolves this failure mode by augmenting the warped scene with an explicit 3D reconstruction of the object obtained from a native 3D generative prior (e.g., SAM3D). The reconstructed object adds missing foreground surfaces and occludes background points that should no longer be visible, restoring both appearance and camera cues without fine-tuning the base model. The same explicit object representation further unlocks a capability current NVS pipelines do not support: incorporating auxiliary views of the object from sources outside the target scene, for example, a casual snapshot of a car paired with a manufacturer studio shot of the same model. We process the reference and auxiliary images jointly with a pretrained multi-view geometry foundation model, which predicts a unified point cloud that we fuse into the 3D object reconstruction. This yields substantially more faithful geometry than single-image reconstruction, without requiring user-provided camera poses for the auxiliary views. On five benchmarks, WarpHammer produces stable novel views at viewpoint deviations where strong baselines collapse, and is the first scene-level NVS method that can naturally fuse auxiliary, pose-unknown object views from an external source.
[CV-95] Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning
链接: https://arxiv.org/abs/2606.31257
作者: Chih-Ting Liao,Fei Shen,Xin Cao,Tat-Seng Chua
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The standard way to read latent knowledge out of a model, a linear probe confirmed by a steering recovery, can systematically overstate what a vision-language model (VLM) actually grounds in the image. We show this on spatial reasoning, where the error is invisible to both probing and steering yet exposed by a one-line causal control: replacing the image with a gray blank. Probes decode the within-axis answer at 73–97% across axes, and a training-free projection lifts a near-chance axis from 59% to 79%, exactly the signature of unlocking latent knowledge. The blank-image arbiter refutes it, revealing three grounding regimes that probing conflates: an axis can be grounded (vision-dependent, correct), a prior (vision-independent, with its decode and its apparent recovery a directional default rather than perception), or, surprisingly, inverted: decodable, causally controllable, but deployed with the wrong sign, so the model scores below chance and the error requires looking. The taxonomy holds across the studied VLMs: in fourteen models spanning six language-model families and 2B–27B, horizontal is grounded, vertical is a prior, and depth is inverted, with the inversion emerging at scale within families. The decode-versus-deploy inversion replicates on seven of eight models across five families, and the minimal edit that re-deploys it varies with geometry: a training-free rotation matches a trained edit on the cleanest model, while distributed inversions need a trained low-rank edit, tracing a per-model correction-complexity spectrum. The cheap, self-calibrating arbiter cleanly separates grounded perception, inverted perception, and prior substitution; we argue it should be a default control for latent-knowledge and steering claims in VLMs.
[CV-96] Rethinking the Role of Feature Engineering and Learning Strategies in Few-Shot Hidden Emotion Recognition
链接: https://arxiv.org/abs/2606.31249
作者: Xiaochuan Guo,Jihao Gu,Haixu Liu,Yuxin Liu,Qi Wang,Yufei Wang,Fei Wang,Kun Li,Dan Guo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this paper, we present the solution developed by our team, XInsight Lab, which achieved first place in Track 3 of the 4th EI-MIGA-IJCAI Challenge with a test accuracy of 0.76923. To address the challenge of weak and sparse implicit emotion evidence in long videos, this paper extends the winning solution from the previous competition and proposes a compact multi-modal temporal modeling framework. The framework integrates and evaluates the effects of multi-source features, including 2D/3D skeletons, facial expression Blendshapes, DINOv2/v3 vision foundation models, X-CLIP video features, and Gemini semantic priors. Architecturally, we propose a cross-attention mechanism that utilizes static pose features, denoted as Base, as the Query and dynamic micro-motion differential features, denoted as Offset, as the Key and Value. By capturing local relative velocities, this mechanism eliminates static biases related to individual body shape and identity. Concurrently, an adaptive pooling method based on Multiple Instance Learning is employed to extract instantaneous emotions while suppressing background noise in long sequences. Finally, the paper reveals the representation collapse phenomenon of general vision foundation models in micro-dynamic tasks, and analyzes the underlying mechanisms where networks fall into public-leaderboard-driven pseudo-generalization due to shortcut learning and rote memorization.
[CV-97] HyperVLP: Enhancing Hierarchical Surgical Video-Language Pre-training in Hyperbolic Space
链接: https://arxiv.org/abs/2606.31245
作者: Yaojun Hu,Kun Yuan,Nassir Navab,Haochao Ying,Jian Wu,Nicolas Padoy
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Surgical vision-language foundation models typically adopt educational materials, such as surgical lecture videos, to transfer surgical knowledge encoded in language into visual representations. These knowledge are multi-dimensional and hierarchical: fine-grained action cues appear in narration, mid-level key steps are summarized in subsection headings, and global procedural context, such as patient history and surgical strategy, is described in abstract texts. Prior work largely collapses these heterogeneous signals into a single flat embedding space, implicitly assuming independence across hierarchy levels. However, this is suboptimal because it ignores cross-level semantic containment, e.g., actions belong to steps, steps compose phases, weakens long-range dependency modeling. To this end, we propose a hyperbolic surgical video-language pre-training framework that explicitly preserves the hierarchical structure by mitigating structural false negatives induced by procedural context and enforcing semantic consistency between parent phases and their constituent child steps. Extensive experiments on multiple surgical benchmarks show consistent gains in zero- and few-shot phase recognition across procedures and institutions.
[CV-98] UHD-MFF: Shattering Barriers in Multi-Focus Ultra-High-Definition Image Fusion via Learnable Lookup Tables ECCV2026
链接: https://arxiv.org/abs/2606.31242
作者: Yibing Zhang,Xunpeng Yi,Qinglong Yan,Yeda Wang,Han Xu,Jiayi Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:With the advancement of imaging technology, ultra-high-definition images have become increasingly essential in modern visual applications. However, existing multi-focus image fusion remains largely confined to low-resolution images and faces three major barriers in UHD scenarios, namely data availability, model adaptability, and deployment feasibility, which severely hinder its practical application. To shatter these barriers, first, we propose the UHD-MFF dataset, the first large-scale ultra-high-resolution multi-focus fusion dataset. Second, we propose a scale-specialized lookup-table framework tailored for ultra-high-resolution images, termed as UMF-LUT. It consists of Coarse-Region Lookup Table (C-LUT) and Detail-Edge Lookup Table (D-LUT). Specifically, C-LUT performs joint queries of multiple gradient cues and semantic cues at low-resolution scales to enable region-level decision-making. Also, D-LUT operates at high-resolution scales, leveraging efficient Laplacian cues to provide complementary edge-level decision information. Such a design makes the model particularly well-suited for ultra-high-resolution multi-focus image fusion. Finally, it offers strong deployability with minimal computational overhead, enabling real-time 4K multi-focus fusion and showing promising potential for smartphone. Extensive experiments demonstrate that it outperforms SOTA methods in both visual fidelity and quantitative metrics. It effectively advances the development of multi-focus image fusion toward ultra-high-resolution imaging scenarios. The code is available at this https URL.
[CV-99] ForgeDrive: Bidirectional Cross-Conditioning for Unified Visual-Action Generation in Autonomous Driving
链接: https://arxiv.org/abs/2606.31226
作者: Xuchang Zhong,He Zheng,Chenxu Zhao,Tianxiong Lv,Hangqi Fan,Bohua Wang,Yushan Liu,Zhihao Liao,Leigang Luo,Congyang Zhao,Yang Cai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:World-model-based autonomous driving endows the model with the ability to understand scene evolution. Yet this promise is undermined by the prevailing imagine-then-act paradigm, which allows errors from the more challenging visual generation stage to cascade into action planning. We introduce ForgeDrive, a unified autoregressive diffusion framework with visual-action cross-conditioning that closes this gap through act-then-imagine paradigm. ForgeDrive factorizes the future as a sequence of per-timestep frame-action pairs, intertwining each action with its corresponding visual observation. During training, we decouple the diffusion timesteps of the two modalities and introduce a UniDiffuser-style noise scheduler to get the ability to infer either modality from its counterpart and deepen understanding of relationships between images and actions. At inference, we propose a novel act-then-imagine inference paradigm, and find that at each step, action generation is a capability internalized during training, requiring no clean future frame as a prerequisite at inference time; instead, the generated action can improve the accuracy of future frame generation, which in turn enhances the quality of the next action. Additionally, we augment each step with future ego-status prediction, further sharpening planning ability. Extensive experiments on NAVSIM demonstrate that ForgeDrive not only unifies driving simulation, planning, and visual odometry into a single model, but also outperforms existing strong planners without any post-training strategy.
[CV-100] CooperScene: Multi-Modal Cooperative Autonomy Benchmark with C-V2X Communication Characterization ECCV2026
链接: https://arxiv.org/abs/2606.31219
作者: Bo Wu,Ruoshen Mo,Justin Yue,Yanyu Zhang,Janice Nguyen,Guoyuan Wu,Amit Roy-Chowdhury,Matthew J. Barth,Hang Qiu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. 15 pages, 15 figures
Abstract:Cellular vehicle-to-everything (C-V2X) enables cooperative perception, prediction, and planning beyond the field of view of individual agents. However, existing datasets often overlook the complexities of real-world deployment, such as limited communication bandwidth and its dynamics, heterogeneous sensing modalities, and scalability beyond a single cooperative partner. In this paper, we introduce CooperScene, a high-fidelity cooperative autonomy dataset with real-world C-V2X communication characterization. The dataset is organized into diverse scenes, including intersections, highway ramps, and parking lots. These scenes involve three connected and autonomous vehicles (CAVs) and one infrastructure roadside unit (RSU), all equipped with multi-modal sensors and commercial off-the-shelf C-V2X communication radios. All scenes are annotated with globally consistent 3D labels at 10 Hz, totaling 344K objects across 59K frames, underpinned by tight sensor- and agent-synchronization, centimeter-level localization and spatial alignment, precise cross-modality calibration, and 3GPP-standard-compliant C-V2X communication. CooperScene establishes a rigorous benchmark for evaluating multi-agent scaling and actual performance in real-world deployable settings. Project website for data and benchmark: this https URL
[CV-101] AC3S: Adaptive Conditioning for 3D-Aware Synthetic Data Generation ECCV2026
链接: https://arxiv.org/abs/2606.31204
作者: Eric Ji,Qiran Hu,Wufei Ma,Sarthak Jain,Yingying Li,Minh N. Do,Yaoyao Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026. Project page: this https URL
Abstract:Synthetic data generation has emerged as a powerful tool for improving data scalability in computer vision. Recent diffusion-based pipelines have demonstrated strong photorealism. However, how to enforce precise 3D structure and pose consistency in generated images remains challenging. Existing methods leverage visual prompts such as edge maps to guide diffusion models, but often suffer from over-conditioning artifacts that degrade image realism and limit dataset quality. In this paper, we present a diffusion-based image generation framework that enforces 3D structural alignment while preserving photorealism through adaptive conditioning. Our framework, Adaptive Conditioning for 3D-Aware Synthetic Data Generation (AC3S), introduces a self-supervised visual prompt modulator that dynamically adjusts the strength of ControlNet conditioning, preventing over-conditioning and enabling the diffusion model to retain its generative expressiveness. To further enhance diversity and semantic consistency, we develop a multi-agent vision language model framework that composes detailed and 3D-aware prompts aligned with the underlying geometric structure. Together, these components enable the scalable generation of high-quality synthetic datasets with accurate 2D and 3D annotations. Extensive experiments demonstrate that our method significantly improves image quality and downstream utility.
[CV-102] ExPLoRe: Expert Patch-Level Loss Routing for Multi-Objective Masked Image Modeling ECCV2026
链接: https://arxiv.org/abs/2606.31201
作者: Konstantinos Georgiou,Maofeng Tang,Hairong Qi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Main paper 15 pages, 3 figures; supplementary material included as appendix
Abstract:Multi-objective masked image modeling (MIM) combines complementary learning signals (token distillation, CLS alignment, and pixel reconstruction) but existing methods weight these objectives with global scalars, ignoring spatial heterogeneity across patches. We present ExPLoRe (Expert Patch-Level Loss Routing), which repurposes Soft Mixture of Experts (MoE) dispatch weights as learned, per-patch loss coefficients. The key mechanism is loss-coupling: allowing loss gradients to flow through dispatch weights to the router enables content-dependent specialization, where different patches receive different emphases across objectives. A detach ablation confirms loss-coupling as the core mechanism, degrading performance by 1.6% when gradients are blocked. On ImageNet-1K with ViT-Base, ExPLoRe improves over non-MoE baselines on two objective combinations (Token+CLS: +0.5% k-NN, +4.4% linear probe; Token+Pixel: +2.2% k-NN), achieving 80.6% linear probe and 85.3% finetuning accuracy, competitive with published methods. For downstream transfer, we develop adaptation recipes (Freeze Routing, Expert Dropout, and Freeze Attention) that improve MoE finetuning by +1.5% over the vanilla MoE, and close a 2.5–2.9 mIoU segmentation gap so that MoE models match or exceed non-MoE baselines on ADE20K.
[CV-103] Distilling Temporal Coherence into 2D Networks for Transrectal Ultrasound Prostate Video Segmentation MICCAI2026
链接: https://arxiv.org/abs/2606.31198
作者: Dong Yeong Kim,JunGyu Lee,Jaewon Choi,June Young Seo,Myeongseop Kim,Jinwook Choi,Taek Min Kim,Young-Gon Kim
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 29th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2026)
Abstract:Real-time video segmentation of the prostate in Transrectal Ultrasound (TRUS) is essential for image-guided interventions. While conventional 2D methods suffer from inter-frame inconsistencies by disregarding temporal context, 3D architectures incur prohibitive latency. To resolve this dilemma, we present a Temporally Consistent Learning Framework that distills temporal coherence into a 2D network during training, preserving single-frame inference efficiency. Our design is driven by a key clinical observation: the prostate exhibits geometric stability, whereas the surrounding acoustic environment fluctuates due to physiological motion and transducer pressure. Because conventional temporal constraints propagate erroneous gradients from these unstable regions, we introduce a Confidence-Weighted Temporal Consistency objective derived from optical flow warping residuals, selectively attenuating contributions from unreliable regions. Complementing this pixel-wise constraint, a Dual-scale Prototype Alignment Module enforces semantic coherence through contrastive optimization of local boundary and global semantic features. Furthermore, to eliminate the need for dense per-frame video annotations, we employ geometric equivariance-based pseudo-labeling with knowledge distillation from a pretrained teacher. Extensive experiments on SUN-SEG and our newly introduced TRUS-V benchmark (2,679 frames) demonstrate state-of-the-art accuracy and temporal consistency at real-time speed. Code and dataset are available at this https URL.
[CV-104] Learning to Deny: Action Denial in Multimodal Large Language Models ECCV2026
链接: https://arxiv.org/abs/2606.31187
作者: Raiyaan Abdullah,Shehreen Azad,Yogesh Singh Rawat
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026 main conference
Abstract:Multimodal large language models (MLLMs) have rapidly advanced video understanding, achieving strong zero-shot and few-shot recognition across standard benchmarks. Yet their ability to deny an action by recognizing when an activity is not happening despite strong contextual cues remains largely unexplored. We introduce UCF101-AD, a large-scale benchmark consisting of paired Action-Presence and Action-Denial clips, designed to evaluate this capacity for denial. Each negative video in UCF101-AD preserves the same contextual and motion cues, including persons, objects, and locations, as its positive counterpart, but the defining action itself is explicitly absent. Evaluating 20 state-of-the-art MLLMs reveals a consistent failure: models that exceed 85% accuracy on the positive action classes collapse below 50% on their action-denial counterparts, indicating a strong inclination to affirm plausible actions rather than verify that they truly occur. This exposes a critical blind spot in modern video understanding: the inability to reason causally about whether a motion actually happens. To probe this issue, we explore a causal graph formulation, CausalAct, which expresses scene structure through natural-language prompts linking context, interaction, and motion. Incorporating such causal cues substantially reduces false positives, demonstrating that denial is a learnable reasoning skill. UCF101-AD provides a new lens for diagnosing and improving causal reasoning in multimodal models. Dataset and relevant code: this https URL.
[CV-105] GaussianMap: Learning Gaussian Representation for Multi-Sensor Online HD Map Construction
链接: https://arxiv.org/abs/2606.31177
作者: Hongyu Lyu,Julie Stephany Berrio Perez,Mao Shan,Stewart Worrall
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous driving systems benefit from high-definition (HD) maps that provide critical information about road infrastructure. The online construction of HD maps offers a scalable approach to generate local vectorized maps from onboard sensor observations. Existing methods commonly adopt bird’s-eye-view (BEV) features as the intermediate scene representation, encoding the surrounding space with fixed-resolution dense grids. However, map elements are spatially sparse yet require fine-grained geometric localization, making uniformly allocated BEV representations redundant and less effective for vectorized map prediction. In this work, we propose GaussianMap, an online HD map construction framework that learns an adaptive Gaussian representation of the surrounding scene. This representation consists of a set of Gaussian primitives on the BEV plane, each encoding a flexible local region with geometric properties and a feature vector, allowing the model to allocate representational capacity to map-relevant regions. To generate such a representation from sensor observations, we introduce a feed-forward Gaussian encoder that progressively refines these primitives through Gaussian interaction modeling and multi-sensor feature aggregation. The refined Gaussian representation is then splatted into a BEV feature map and decoded into vectorized map predictions. Extensive experiments on nuScenes and Argoverse 2 datasets demonstrate that GaussianMap achieves state-of-the-art performance in both camera-only and camera-LiDAR fusion settings. Our code will be made publicly available.
[CV-106] HSDF-Lane: Height-Aligned Signed Distance Field with Semantic Lane Prior for 3D Lane Detection ECCV2026
链接: https://arxiv.org/abs/2606.31172
作者: Jiyong Boo,Byeongin Joung,Hyemin Yang,Kuk-Jin Yoon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026, Project page: this https URL
Abstract:Monocular 3D lane detection plays a critical role in autonomous driving, yet recovering reliable 3D geometry from a single image remains challenging due to inherent depth ambiguity. Prior methods project image features into Bird’s-Eye-View (BEV) space under a flat-ground assumption, causing geometric distortion on real-world roads. Recent methods instead predict explicit height maps to capture non-planar surfaces, but still rely on sparse anchor-based regression and exploit the recovered geometry merely for spatial transformation rather than semantic understanding. To overcome these limitations, we propose HSDF-Lane, which implicitly models the road surface as a Height-aligned Signed Distance Field (HSDF) over a densely sampled 3D feature volume. Through differentiable rendering, the HSDF jointly produces an accurate height map and surface-aligned features. We further introduce Lane-aware Semantic Positional Encoding (LSPE), which injects a lane-existence prior derived from the surface-aligned features into the transformer queries, coupling geometric structure with semantic guidance. Extensive experiments on the OpenLane benchmark show that HSDF-Lane achieves state-of-the-art performance in both 3D lane detection and height map estimation.
[CV-107] Beyond Single Character: Evaluating MLLM s for Sentence-Level Oracle Bone Inscription Understanding
链接: https://arxiv.org/abs/2606.31169
作者: Ziqi Li,Zijian Chen,Tingzhu Chen,Guangtao Zhai
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures
Abstract:Existing AI-assisted oracle bone inscription (OBI) visual recognition and understanding studies mainly focus on character-level, ignoring the long-form textual coherence and contextual dependencies embedded in complete divination charges. Recently, the powerful visual perception capabilities of multimodal large language models (MLLMs) have opened new possibilities for OBI information processing. In this work, we introduce S-OBI, a novel benchmark for evaluating MLLMs in Sentence-level OBI understanding. Instead of using noisy and incomplete rubbings as the visual input, S-OBI synthesizes clear and standardized sentence-level OBI instances through glyph substitution and composition. According to 95 original rubbings with translations that have been identified, corrected, and verified by experts, we replace characters in the original rubbings with corresponding clean glyph samples sourced from existing OBI datasets while preserving the overall inscriptional structure and semantic organization. This mitigates the influence of low-level distortions and enables a more focused evaluation of sentence-level OBI understanding. Based on this, we design semantic matching, semantic slot extraction, and contextual reasoning tasks and obtain 695 question-answer pairs. Experiments reveal the inferiority of contemporary MLLMs on sentence-level OBI understanding. In particular, visual perception errors in unmasked regions propagate through the reasoning chain, leading to erroneous predictions for masked characters, which indicates that sentence-level OBI understanding in current models remains strongly dependent on character-level recognition. Overall, S-OBI provides a diagnostic benchmark for evaluating whether MLLMs can move beyond isolated character recognition toward structured inscription-level understanding.
[CV-108] Seeing Through the Weights: Privacy Leakage in Scene Coordinate Regression
链接: https://arxiv.org/abs/2606.31164
作者: Oleksii Nasypanyi,Jaemin Cho,Utku Ozbulak,Byungkon Kang,Francois Rameau
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Scene Coordinate Regression (SCR) methods are increasingly adopted for visual localization. In these approaches, the scene is implicitly encoded within a neural network that regresses a 3D world coordinate for each image pixel. Because the scene is represented only through the network parameters and not stored explicitly as images or maps, such methods are often assumed to be privacy-preserving. In this work, we show that this assumption is incorrect in practice. Specifically, we introduce a query-based attack that reconstructs the 3D geometry of the training environment from an SCR model under different levels of model access. To do so, we repeatedly query the model with batches of proxy images unrelated to the target scene to obtain dense pixel-wise 3D coordinates. Reliable points are identified through their stability under small input perturbations and can be further refined in a white-box setting. These stable points are accumulated across independent query batches to recover the scene geometry. From the recovered 3D representation, we also invert the network features to synthesize images from arbitrary viewpoints, revealing additional appearance information. Experiments on indoor and outdoor datasets demonstrate that substantial portions of training environments can be reconstructed with high geometric fidelity. Beyond geometry, we also recover an approximate color appearance, which exposes recognizable layout and potentially sensitive scene elements. This directly contradicts claims in the literature that SCR representations are privacy-preserving by design, and reveals a real risk when such systems are deployed in private or security-critical spaces. The project page is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2606.31164 [cs.CV] (or arXiv:2606.31164v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2606.31164 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[CV-109] Reasoning -aware Speculative Decoding for Efficient Vision-Language-Action Models in Autonomous Driving
链接: https://arxiv.org/abs/2606.31160
作者: Anh Dung Dinh,Simon Khan,Flora Salim
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:Modern Vision-Language-Action (VLA) planners for autonomous driving emit a chain-of-causation (CoC) reasoning step \emphbefore producing a trajectory. The reasoning is autoregressive and dominates inference latency, while the trajectory head is parallel and cheap. Latency is an operational constraint in autonomous driving, so accelerating the reasoning step is the central problem we address. We observe that CoC reasoning has two qualitatively different needs: most tokens continue routine setup that follows naturally from the ego-trajectory history, and a small fraction encode commitments that require fresh visual evidence about an unexpected situation. We split this reasoning into two specialized paths: a \emphroutine reasoner that handles the predictable continuation by attending to trajectory history, and a \emphdeliberative reasoner (the unmodified VLA target) that handles novel cases by attending to current visual evidence, using the speculative decoding framework as the architectural template for how the two paths cooperate. Unlike standard speculative decoding, our routine reasoner is not a smaller replica of the target; the two reasoners are deliberately specialized to read different parts of the prompt. We propose two techniques to realize this. First, we introduce \textbfFlatRoPE, a 1D rotary positional embedding in the draft that breaks the rotational symmetry of the target’s 3D M-RoPE, redirecting attention away from visual tokens and onto trajectory-history tokens. Second, we introduce \textbfAction-aware RL (AARL), a post-training stage that uses an action-quality reward together with a static-reference KL anchor. Together, our two-reasoner system reduces the reasoning-step running time by approximately 4\times relative to the original Alpamayo planner.
[CV-110] Rethinking Foundation Model Collaboration: Enhancing Specialized Models through Proxy Task Reasoning
链接: https://arxiv.org/abs/2606.31157
作者: Hongyi Lin,Yang Liu,Jinhua Zhao,Xiaobo Qu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models are increasingly integrated into embodied intelligence systems, but directly assigning them structured prediction tasks requires precise geometric and numerical estimation, where specialized models often remain stronger. This capability mismatch raises a key question: should foundation models replace task-specific predictors, or should they collaborate through tasks better aligned with their strengths? We propose FAT, a foundation-model-augmented task-specific reasoning framework that treats collaboration as task decomposition rather than model replacement. FAT decomposes structured prediction into specialist prediction, information-space reconstruction, and foundation-model proxy reasoning. The specialist generates geometrically and physically valid hypotheses in the native output space, while the foundation model performs a bounded proxy task, such as selection or verification, over reconstructed multimodal candidates. We instantiate this principle as ProxySelect with a vision–language model. Across 2D object detection, 3D object detection, trajectory prediction, and semantic segmentation, ProxySelect consistently improves specialized baselines and substantially outperforms direct foundation-model regression at lower computational cost. These results suggest a general collaboration principle: specialized models preserve task-specific structure, while foundation models refine their hypotheses through contextual proxy reasoning.
[CV-111] WaterGen: Decoupling Scene and Medium in Underwater Image Generation
链接: https://arxiv.org/abs/2606.31147
作者: Jiayi Wu,Tianfu Wang,Tianyi Xiong,Dehao Yuan,Xiaomin Lin,Md Jahidul Islam,Cornelia Fermuller,Christopher Metzler,Yiannis Aloimonos
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater computer vision tasks, such as detection, restoration, and segmentation, are limited by the scarcity of large-scale and diverse training data. We introduce WaterGen, a method for generating large-scale, realistic, and diverse underwater images that provides independent control of the scene and water medium conditions. Our approach treats underwater image generation as the decoupled control of two factors: realistic and diverse scene content (what is in the image), and accurate and controllable water medium effects (what the water does to the image). Existing methods generally achieve only part of this objective: they either provide controllability with limited realism or diversity, or generate realistic scenes without accurately and independently modeling water-medium effects. Our key insight, that allows us to avoid this compromise, is that scene generation and medium modeling can be decoupled within a latent diffusion framework, enabling diverse scene generation together with accurate and controllable underwater appearance. To do this, we decompose underwater image synthesis into two stages. First, we fine-tune the latent diffusion U-Net using degradation-free underwater images so that it learns to generate diverse and realistic latent embeddings of underwater scene content without medium-induced degradation. Second, we formulate the physically accurate medium degradation synthesis as a conditional decoding process applied to these latent embeddings. This decoupled design allows our model to generate diverse scenes with full control of underwater appearance. We leverage WaterGen to build large-scale synthetic underwater datasets that are diverse in scene structures and accurate in water effects and pseudo-labels. We demonstrate that our synthetic data consistently improve downstream performance in underwater restoration and semantic segmentation.
[CV-112] FROST: Training-Free Few-Shot Segmentation with Frozen Features and Nonparametric Statistics
链接: https://arxiv.org/abs/2606.31136
作者: Junghwan Park
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages
Abstract:Few-shot segmentation asks a model to delineate a target class in a query image from only a handful of annotated examples, a setting most acute in remote sensing, where labels are scarce and the imagery departs sharply from the natural images on which vision backbones are pretrained. Prevailing approaches either train a segmenter on labelled episodes, which raises accuracy within the training distribution but binds the model to it, or reduce each class to a lossy summary of frozen features, a single prototype, a few cluster prototypes, or a discrete clustering, none of which preserves the internal structure of a multimodal class. We argue that a class is better described by a distribution than by a point, and that frozen self-supervised features already carry enough structure to estimate that distribution directly. We introduce FROST, a training-free few-shot segmenter that treats the reference foreground and background as two point clouds on the unit sphere of frozen DINOv3 features and labels each query token by a nonparametric density ratio, with a threshold the Bayes rule fixes at zero under equal priors. Because the variance of a density estimate shrinks as its sample grows, the decision sharpens as references accumulate, and every remaining quantity from the kernel bandwidth to the spatial gate is read from the support set rather than tuned. We develop FROST for overhead imagery, where a class is typically a scatter of many small and dissimilar instances that a density tracks but a lossy summary blurs. Across seventeen remote-sensing benchmarks FROST surpasses both training-free and learning-based methods, leading by 5.6 mIoU from a single annotated example and widening its lead as the support set grows, all while remaining among the smallest models compared. Code is available at this https URL.
[CV-113] MSNN-LINet: Cross-Modal Learning via Continuous Linear Integration
链接: https://arxiv.org/abs/2606.31135
作者: Gabriel Clinger
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 6 figures, 3 tables
Abstract:We present LINet (Linear Integration Network), a Multi-Stream Neural Network (MSNN) for RGB-D scene classification. Current multi-modal architectures treat feature fusion as a discrete, ad-hoc event: early fusion entangles representations prematurely, late fusion isolates them until the final layer, and hybrid or attention-based methods require architectural guesswork to place intermediate fusion blocks. LINet addresses this structural compromise by maintaining three dedicated parallel streams (RGB, depth, and integration) where a novel Linear Integration Convolution (LIConv2d) operator enables continuous cross-modal learning at every layer. The integration stream receives raw filtered signals from both modality streams and combines them before the nonlinear activation threshold, conceptually inspired by somatic integration preceding the neuronal firing decision. Implementing continuous integration exposes a critical initialization pathology: Kaiming initialization of the bridging weights scrambles gradients before they reach the stream backbones, producing a failure mode that resembles overfitting but is corrupted gradient flow. A 1/N constant initialization mitigates this. We employ progressive modality dropout, a curriculum adapted to continuous fusion in which blanking probability increases from zero, preventing pathway collapse, a form of negative co-learning, by forcing robust independent stream representations. Trained from scratch on SUN RGB-D 19-class scene classification, LINet reaches 45.2% mean class accuracy at ResNet18 scale, outperforming prior from-scratch results, and rises to 49.6% with in-domain RGB-D (ScanNet) pretraining.
[CV-114] SkillSpotter: Pose-Aware Multi-View Skilled Action Detection and Grading in Ego-Exo Videos ECCV
链接: https://arxiv.org/abs/2606.31127
作者: Björn Braun,Christian Holz
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for publication at European Conference on Computer Vision (ECCV)
Abstract:To enable personalized, real-time coaching using Augmented Reality glasses or fixed camera setups in domains such as sports, cooking, or music, a system must understand not just what a person does, but how well they execute an activity. In an ego-exo video setting, this requires simultaneously detecting individual skilled actions and classifying each as correct or needing improvement, which Ego-Exo4D’s proficiency demonstration benchmark formalized. We first adapt seven state-of-the-art temporal action detection architectures to this task, extend the evaluation protocol to disentangle detection from grading, and show that existing methods grade near-randomly. We then introduce SkillSpotter, a pose-aware multi-view architecture that jointly detects and grades skilled actions through three task-specific modules: (1) adaptive temporal suppression to handle the varying density of skilled actions across diverse activities, (2) gated 3D body pose fusion to leverage body kinematics as a complementary signal to visual features, and (3) bidirectional cross-view attention to combine ego and exo views effectively. SkillSpotter improves class-specific mAP from 12.40 to 21.82 (+76%) and balanced accuracy from 55.99% to 60.40% over the best baseline. SkillSpotter’s modules transfer to other temporal action detection models with consistent gains, and our method generalizes beyond Ego-Exo4D to HoloAssist. Code: this https URL
[CV-115] WildProp: Visual Estimation of Wildlife Body Proportions at Scale ECCV26
链接: https://arxiv.org/abs/2606.31125
作者: Mustafa Chasmai,Aaron Sun,Subhransu Maji
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 26
Abstract:Population-level morphometric measurements underpin ecological and evolutionary studies but traditionally require controlled imaging or physical specimen handling, limiting scalability. We present WildProp, a training-free framework that estimates wildlife body proportion distributions directly from large-scale, unconstrained image repositories. We cast morphometric estimation as a retrieval-driven correspondence problem: given a single user-annotated canonical image, WildProp performs pose-aware retrieval using foundation model features, transfers part endpoints via dense patch-level matching, filters predictions using geometric consistency, and aggregates measurements across retrieved images to estimate population-level ratio distributions. Unlike supervised keypoint pipelines, our approach adapts to arbitrary species and user-defined parts without per-species training. Evaluations on three large morphometric datasets spanning birds and amphibians show median relative errors of 10-20%. We further highlight the broad applicability of our approach through a number of case studies measuring various proportions across diverse taxa, including birds, frogs, insects, and flowers. Ablations demonstrate that pose-aware retrieval is critical for stable estimation, while robust aggregation mitigates keypoint and pose noise. Our results indicate that carefully curated 2D correspondences over web-scale imagery can provide scalable morphometric proxies for comparative and subgroup analyses across taxa, geography, and seasonality.
[CV-116] JacobianAvatar: Temporally Consistent Semi-rigid Avatar Reconstruction from a Monocular Video
链接: https://arxiv.org/abs/2606.31115
作者: Changyeon Won,Min-Gyu Park,Seonghwan Park,Ju Hong Yoon,Hae-Gon Jeon
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic human avatars in complex motions–such as clothing dynamics–requires modeling of global and local deformations which remains challenging in monocular settings. We address this problem by leveraging neural Jacobian fields (NJFs) for representing semi-rigid deformations. We train self-supervised neural networks for predicting Jacobian matrices that give the pose-dependent deformations, by solving a Poisson equation. However, monocular input presents several difficulties such as self-occluded regions and invisible surfaces. To address these issues, we introduce three key components: a constrained Poisson solver, signed distance-based Jacobian regularization, and a deformation-guided residual flow loss, which together suppress boundary artifacts, recover frequently occluded regions such as armpits and thighs, and enforce temporal consistency during motion. Experiments on benchmark and in-the-wild videos demonstrate that our method generates temporally stable and geometrically coherent avatars, outperforming state-of-the-art approaches.
[CV-117] InfiniVerse: Occupancy Guided Unbounded Scene Generation for Autonomous Driving
链接: https://arxiv.org/abs/2606.31109
作者: Xiaoyu Ye,Leheng Li,Xinyu Ji,Yingjie Cai,Hongda He,Xu Yan,Guanyi Zhao,Ying-Cong Chen,Bingbing Liu,Shuguang Cui,Zhen Li
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generating realistic, controllable, and temporally coherent urban environments is a critical yet unresolved challenge in the autonomous driving community. In this paper, we introduce InfiniVerse, a unified pipeline for long-range, 2D-3D-aligned, and controllable synthesis of dynamic urban scenes from a single frame. In practice, our approach first reconstructs a 3D occupancy representation from the input multi-view frame. This representation serves as a foundation for autoregressive scene extension along arbitrary trajectories. Subsequently, a video diffusion model translates the coarse occupancy grid into realistic, spatiotemporally consistent video sequences. Moreover, we propose a hierarchical sketch-and-refine paradigm, in which the generated videos are re-projected as image-conditioned feedback to enhance the 3D occupancy representation, establishing cross-modal alignment and mutual enhancement between the visual and spatial domains. Extensive evaluations on the Waymo Open Dataset and nuScenes demonstrate that InfiniVerse achieves state-of-the-art performance, with a FID of 6.4 and FVD of 67.97, significantly outperforming existing benchmarks in both duration and stability.
[CV-118] axoMIL: Taxonomy-Constrained Learning for Hierarchical Whole Slide Image Analysis ECCV2026
链接: https://arxiv.org/abs/2606.31100
作者: Chaeyeon Lee,Khang Nguyen Quoc,Jinsol Song,Yosep Chong,Kwangil Yim,Jin Tae Kwak
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ECCV 2026
Abstract:Whole slide image (WSI) analysis is central to computational pathology, with multiple instance learning (MIL) emerging as the standard pipeline for slide-level diagnosis. However, conventional approaches formulate WSI diagnosis as a flat classification task over discrete labels, contradicting the inherently hierarchical, coarse-to-fine nature of clinical reasoning. Although recent hierarchical classifiers and vision-language models (VLMs) have sought to address this structural gap, they either fail to capture semantic continuity between related diagnoses or suffer from unconstrained text generation that produces taxonomic hallucinations and parent-child label violations. To address these limitations, we propose TaxoMIL, a taxonomy-constrained framework that reformulates WSI diagnosis as a multi-granularity text generation task. TaxoMIL utilizes a dual-head Transformer decoder to generate coarse- and fine-level diagnostic text, and introduces taxonomy-guided objectives that explicitly structure the label embedding space and strictly ground slide-level visual representations within the clinical taxonomy. Extensive experiments across three diverse WSI datasets demonstrate that TaxoMIL consistently outperforms state-of-the-art MIL classifiers and VLM-based generative methods, yielding accurate and hierarchy-aware diagnostic predictions. The code is released at this https URL
[CV-119] Seeing Through Multiple Views: Parameter-Efficient Fine-Tuning via Selective Neurons for Consistent Radiology Report Generation MICCAI2026
链接: https://arxiv.org/abs/2606.31099
作者: Yucheng Chen,Jinjing Zhu,Yang Yu,Yufei Shi,Hane Naghshbandi,Jinhua Liu,Angela S. Koh,Fang Fen,Kian Eng Ong,Si Yong Yeo
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI2026
Abstract:Recent years have seen substantial advances in radiology report generation (RRG), yet existing approaches predominantly adopt direct feature fusion when handling multi-view X-ray images. Such approaches overlook the potential clinical inconsistencies and inaccuracies arising when a single model processes different views, adversely impacting performance and clinical reliability. To this end, we introduce View-PNDF (View-specific Pattern Neuron Detection and Fine-tuning), a parameter-efficient framework that fosters view-consistent report generation from a neuronal perspective. Specifically, View-PNDF comprises: (i) a view-specific neuron detection module identifying neurons responsive to particular views, (ii) a verification module quantifying the existence of these neurons, and (iii) a selective fine-tuning strategy strengthening detected neurons while preserving view-agnostic representations. By updating only view-specific neurons, View-PNDF achieves consistent diagnoses across different views with reduced computational costs. Subsequently, we employ Large Language Models (LLMs) to consolidate the view-specific reports into a complete radiology report. Furthermore, we use traditional Natural Language Generation (NLG) metrics-based assessment on integrated reports for baseline comparison and employ LLM-based assessment (e.g., GPT-4o) on view-specific reports to capture clinical significance. Extensive experiments on two medical RRG benchmarks demonstrate that View-PNDF substantially improves view-specific chest X-ray report generation quality while maintaining robust general-view performance.
[CV-120] PiLoT v2: Pixel-to-Orthogonal Map Alignment for Free-view UAV Geo-localization
链接: https://arxiv.org/abs/2606.31098
作者: Xinyi Liu,Xiaoya Cheng,Rouwan Wu,Zhaochen Wang,Shen Yan,Maojun Zhang,Yu Liu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-time, drift-free UAV geo-localization is essential for autonomous missions in GNSS-denied environments. The pioneering system, PiLoT, achieves high precision via Neural Pixel-to-3D Registration, aligning UAV video streams with a single rendered reference view from 3D meshes. However, its reliance on heavy 3D meshes incurs massive storage overheads, complex map acquisition, and significant computational rendering costs, severely hindering deployment on embedded platforms. To address these bottlenecks, we propose PiLoT v2, a lightweight yet robust evolution that shifts the paradigm to direct pixel-to-orthogonal map registration for free-view UAV geo-localization. By leveraging True Digital Orthophoto Maps (TDOMs) and Digital Surface Models (DSMs) as the reference substrate, PiLoT v2 replaces GPU-intensive 3D rendering with a highly efficient, CPU-friendly map cropping operation. To bridge the severe geometric discrepancy between these 2.5D orthogonal crops and free-view oblique UAV imagery, we train a cross-view feature registration network using a novel, large-scale geometrically annotated dataset. Furthermore, we integrate onboard sensor prior–specifically gravity direction and single-point laser rang–directly into the pose optimization manifold to enhance robustness against cross-view visual degradation. Experimental results demonstrate that PiLoT v2 achieves performance comparable to, or even exceeding, its Pixel-to-3D predecessor, while offering drastically lower storage and computational costs.
[CV-121] Horizon3D: Sparse Radar-Camera Fusion for Long-Range 3D Perception in Autonomous Driving ECCV2026
链接: https://arxiv.org/abs/2606.31096
作者: Geonho Bang,Geunju Baek,Dongyoung Lee,Wonjun Jeong,Jun Won Choi
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Project page: this https URL . Code: this https URL
Abstract:Long-range 3D object detection is critical for safe autonomous driving at highway speeds, yet existing radar-camera fusion methods remain limited at extended ranges. BEV-based methods capture scene-level context but incur rapidly growing computation and often lose fine-grained object detail, while query-based methods are efficient but provide limited scene-level context. Temporal fusion further requires both multi-frame accumulation for sparse distant observations and object-level motion modeling for fast-moving objects. We propose Horizon3D, a sparse radar-camera fusion framework for long-range 3D object detection that combines Gaussian primitives with sparse BEV features. Horizon3D initializes Gaussian primitives at radar- and camera-estimated object keypoints using Keypoint-Guided Gaussian Initialization, refines them through Object-Centric Sparse Fusion, and splats them onto the BEV plane to fuse object-level detail with sparse radar BEV context. It further introduces Dual-Path Temporal Fusion, which aggregates temporal cues through a BEV path for scene-level accumulation and a Gaussian path for object-level motion propagation. Experiments on TruckScenes show that Horizon3D achieves state-of-the-art radar-camera 3D detection performance. On the validation set, it outperforms the previous best method by +3.0 NDS and +1.6 mAP while maintaining competitive inference speed.
[CV-122] Do Not Break the Vessels: Structure-Preserving Mean Flow for Vascular Image Translation
链接: https://arxiv.org/abs/2606.31095
作者: Changjin Sun,Zhuo Hu,Kaini Wang,Baixuan Wu,Shuo Gao,Runan Zheng,Cheng Xue,Yudong Zhang,Guangquan Zhou
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing anatomically faithful vascular structures from clinically accessible imaging modalities is of substantial clinical significance. However, existing cross-modal translation methods mainly emphasize pixel-level fidelity or visual realism and treat structure preservation as a property of the final output rather than an invariant of the generative process. This limitation often leads to structural discontinuities and artifacts, compromising anatomical coherence and clinical reliability. In this work, we propose a Structure-Preserving Mean Flow (SPMF) framework that formulates vascular image translation as a topology-invariant transport process. Based on a structural invariance principle, we derive an orthogonality constraint on the flow velocity field that formally separates appearance transport from topological distortion. We implement this constraint as a time-weighted surrogate objective within a Brownian bridge diffusion model to preserve topology at every diffusion step. Moreover, we propose a Prototype-Guided Structural Refinement (PGSR) module to align degraded inference-time structures with reliable training-time structures. Experiments on paired NIRII-to-2PF and fundus datasets demonstrate consistent improvements over state-of-the-art methods, achieving peak PSNR values of 24.96 dB and 24.83 dB, respectively.
[CV-123] Anchoring on Reality: Breaking the Pseudo-Target Ceiling in Makeup Transfer ECCV2026
链接: https://arxiv.org/abs/2606.31089
作者: Bo Wei,Xianhui Lin,Yi Dong,Zhongzhong Li,Zonghui Li,Zirui Wang,Jiachen Yang,Xing Liu,Hong Gu,Xiaoming Li,Wangmeng Zuo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ECCV 2026
Abstract:Makeup transfer applies a reference cosmetic style to a source face while preserving its identity and geometry. However, this task is severely hindered by the lack of real paired training data. Current methods rely on either weak priors or synthetic pseudo-targets from large-scale editing models. These paradigms provide suboptimal guidance, often leading to degraded fine-grained details, synthetic artifacts, and identity drift. To this end, we propose Anchoring on Reality Makeup Transfer (ART), a two-stage framework with a reality-anchored refinement cycle. In Stage I, the model is initialized with pseudo-targets to establish basic semantic alignment and global makeup placement. Crucially, Stage II shifts supervision from pseudo-targets to the real reference, reconstructing it from its bare-skin counterpart through a differentiable cycle that penalizes any omitted detail and overrides synthetic artifacts. Furthermore, we introduce MakeupFaces2K (MF2K), the first 2K-resolution in-the-wild makeup portrait dataset comprising 8,573 images. Extensive experiments demonstrate that our method achieves superior makeup fidelity, strong background stability, and robust identity preservation, especially for complex makeup styles.
[CV-124] owards Flexible Natural Efficient Interaction for Conversational Talking Face Generation
链接: https://arxiv.org/abs/2606.31088
作者: Baiqin Wang,Sen Chen,Jiankuo Zhao,Xiangyu Liu,Zhen Lei,Xiangyu Zhu
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 Pages,8 figures
Abstract:Conversational talking face generation has recently attracted increasing attention, aiming to synthesize interactive talking videos where characters speak, listen, and respond dynamically to each other. This task presents three core challenges: 1) Flexibility: enabling multi-round dialogues with an arbitrary number of participants; 2) Naturalness: maintaining coherent motion and appropriate non-verbal feedback throughout the interaction; and 3) Efficiency: achieving real-time generation and low computation overhead for long-term continuous online conversation. Despite recent advances, existing methods still fall short in balancing all three requirements. To bridge this gap, we introduce InterTalk, a novel and efficient framework designed for highly interactive conversational talking face generation. Built upon a motion-based architecture, InterTalk supports real-time conversation synthesis. Our method achieves strong flexibility by explicitly modeling multi-round conversational dynamics among each participant, eliminating constraints on their numbers. To enhance interactivity, we incorporate motion feedback from multiple participants and introduce an iterative generation strategy for more natural behaviors. Besides, we disentangle motion into several facial components, enabling targeted refinements for natural response such as precise lip sync and realistic eye blinking. Finally, we construct a new multi-person conversational dataset and enrich it with 3D face-based data augmentation. Extensive experiments demonstrate that InterTalk achieves superior interaction quality while maintaining real-time performance at 30 FPS.
[CV-125] CasaMaestro: Multi-View Panoramas for House-Scale 3D Reconstruction ECCV2026
链接: https://arxiv.org/abs/2606.31086
作者: Yuzhou Ji,Xiaotian Yang,Zhipeng Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV2026
Abstract:The rise of home-deployed embodied AI systems is driving a growing need for fast, metric 3D reconstruction of residential spaces to support navigation, interaction, and long-horizon task execution. However, the commonly used pinhole-camera 3D reconstruction pipelines struggle to model large indoor residences efficiently due to their limited field of view, to which achieving full coverage across multiple rooms often requires thousands of images and incurs drift from long chains of incremental alignment. In this work, we present CasaMaestro (Spanish words meaning house'' and master’'), a feedforward model that can take only twenty to fifty sparse multi-view indoor panoramas as input and directly predicts metric depth along with camera poses, allowing fast point-cloud reconstruction of the entire house with full coverage. CasaMaestro is the first model that supports house-scale reconstruction with multi-view panoramas. Experiments show that CasaMaestro can robustly provide high quality results in both real-world and synthetic scenes, which can serve as a strong foundation for acquiring house-scale 3D indoor assets to be applied in close-loop simulation.
[CV-126] Fleet: Few Shots Lead Effective AI-generated Image Detection ICML2026
链接: https://arxiv.org/abs/2606.31082
作者: Jiaan Wang,Sirui Liu,Yu Li,Kaiyuan Yang,Juan Cao,Sheng Tang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, accepted by ICML 2026
Abstract:AI-generated image (AIGI) detection is undergoing a critical transition from laboratory benchmarks to open-world adversarial defense. The prevalent paradigm focuses on finding static feature spaces, assuming that some invariant artifacts learned from historical data can achieve universal zero-shot generalization. While achieving saturation on several AIGI benchmarks, this static hypothesis suffers a severe performance drop against rapidly evolving generators (e.g., SD3, Nano Banana Pro). To address these limitations, we propose that the field should expand beyond “static generalization” to a new paradigm of “dynamic adaptation”. We introduce Fleet, a framework that pioneers a dynamic paradigm of continuous few-shot evolution, enabling rapid alignment with emerging generative threats. Fleet improves few-shot adaptation by replacing unconstrained feature updates with constrained routing correction, where avoidance routing redirects novel AI samples away from Non-AI-dominated routes within decoupled subspaces. To validate this, we present Treasure, a benchmark spanning 64 models and 360k images, featuring diverse architectures and 20 closed-source commercial engines. Experiments reveal that while static SOTA methods fail catastrophically on modern generators, Fleet restores performance from 20.4% to 73.1% with only 10-shot adaptation on “Doubao Seedream 4.0”. Code and data are available at this https URL .
[CV-127] AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images
链接: https://arxiv.org/abs/2606.31077
作者: Meng Yang,Zizhuo Li,Linfeng Tang,Fan Fan,Jiayi Ma
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal image matching is essential for visual localization and multi-sensor fusion, but it is hindered by the scarcity of large-scale training data with precise geometric annotations. Existing real-world datasets suffer from prohibitive costs, limited scene diversity, and errors in SfM-MVS pipelines, while synthetic methods struggle to maintain 3D geometric consistency or achieve photorealistic appearance. To address this, we propose AnyMatch, a novel framework that leverages abundant, easily accessible single-view images at minimal cost to generate rich multi-modal training data. AnyMatch integrates monocular depth estimation, 3D reprojection, diffusion-based inpainting, and crossmodal image translation to synthesize multi-view, multi-modal image pairs with 3D geometric fidelity. Crucially, our method provides annotations that strictly adhere to 3D geometric consistency through explicit 3D reprojection, avoiding SfM-MVS error accumulation. Furthermore, AnyMatch offers strong scalability, enabling controllable scene diversity and annotation difficulty via adjustable input and camera parameters. We construct Any-syn, a large-scale synthetic multi-modal dataset using AnyMatch. Experimental results show that matching networks (e.g., LoFTR, EDM, RoMa) fine-tuned on Any-syn achieve substantial performance gains on multi-modal benchmarks, exhibiting superior generalization and robustness compared to models trained on existing data.
[CV-128] Hierarchical 3D Scene Graph Construction and Belief-based Planning for Semantic Navigation ECCV2026
链接: https://arxiv.org/abs/2606.31071
作者: Bing Wu,Zuyao Chen,Changwen Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Camera-ready version accepted at ECCV 2026
Abstract:Semantic navigation is a fundamental task for embodied agents operating in unseen environments, requiring both semantic understanding and long-term decision-making. Recent foundation models have empowered agents with rich semantic priors for this task. However, without structured global representations, decision-making often falls back on local observations and greedy strategies, resulting in inefficient exploration and myopic behaviors, especially in long-distance navigation. To address these challenges, we propose a zero-shot semantic navigation framework. Our method incrementally maintains an online Hierarchical 3D Scene Graph (HSG) to form a multi-granular semantic topology over objects, zones, and regions, serving as a compact state abstraction for global planning. Building on this memory, we introduce a hierarchical belief-based planning framework that fuses semantic priors with exploration evidence on the HSG, and performs finite-horizon rollouts on an HSG-based simulator to explicitly estimate the long-term expected returns of candidate macro-actions. This enables globally consistent decisions and reduces redundant backtracking. Extensive experiments in high-fidelity simulation environments across multiple tasks and datasets demonstrate that our method outperforms existing state-of-the-art methods, particularly in long-distance scenarios, where our approach improves SR and SPL by an average of 9.4% and 5.0%, respectively.
[CV-129] Hybrid Unet-Transformer Model for Generating Stress and Strain Fields from Composite Geometrics
链接: https://arxiv.org/abs/2606.31068
作者: Shrey Patel
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: International Conference on Emerging Digital Intelligence and Generative Engineering
Abstract:Accurate prediction of stress and strain fields in hierarchical composite microstructures is critical for physics-informed material design, yet conventional finite element method (FEM) simulations are computationally prohibitive at scale, requiring minutes to days per evaluation. In this work, we propose a hybrid UNet-Transformer architecture that predicts complex mechanical field distributions directly from composite microstructure geometry images, serving as an efficient surrogate for FEM across ten distinct stress and strain field types spanning diverse two-phase composite configurations including square, hexagonal, and triangular tessellations, multiple boundary conditions, and high-resolution geometries. Results demonstrate that the proposed architecture achieves strong predictive performance across the majority of subdatasets, with peak accuracy on periodic tessellation geometries reaching R2=0.9991, SSIM=0.9936, and MAE=0.0050 on the boundary condition subdataset and the triangular tessellation subdataset respectively. Across six of the eight evaluated subdatasets, MAE remains below 0.05 on the normalized [0,1] pixel scale. Encoder attention analysis via Grad-CAM and Grad-CAM++ confirms that the model develops physically meaningful internal representations, localizing attention at mechanically critical regions including phase boundaries, ligament junctions, and indenter contact zones without explicit structural supervision. Performance degrades on irregular square-grid geometries with sparse soft-phase inclusions, with the S11 normal stress subdataset yielding R2=0.7735 and SSIM=0.7126, consistent with the known limitation of smooth-loss image translation models in reproducing sharp stress discontinuities.
[CV-130] Diffusion-Based Material Regularization for Physics-Based Inverse Rendering ECCV2026
链接: https://arxiv.org/abs/2606.31065
作者: Jingwang Ling,Lifan Wu,Feng Xu,Shuang Zhao
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026. Includes supplementary material. Project page: this https URL
Abstract:Reconstructing physics-based 3D assets – geometry, materials, and illumination – from multi-view images is a core problem in computer graphics and vision, and a prerequisite for realistic relighting and editing. Physics-based inverse rendering offers an accurate image-formation model, but is severely underconstrained: without strong priors, illumination is baked into materials, and reconstructions generalize poorly to novel views and lighting. Data-driven diffusion models, in contrast, predict visually plausible materials, yet their predictions rarely satisfy the rendering equation and are not directly usable for physics-based rendering. We bridge these two paradigms rather than replacing either. Our key idea is to treat the predictions of a state-of-the-art diffusion model not as target material values but as a similarity kernel for optimization: we introduce a regularization loss that penalizes deviations in the optimized material over surface regions where the diffusion predictions are near-constant, while leaving the optimization free to match the input images. Built on this regularizer, our end-to-end pipeline jointly reconstructs geometry, materials, and illumination, yielding high-quality assets that drop into standard rendering pipelines and relight faithfully. On the Synthetic4Relight, Stanford-ORB, and DTC-Synthetic datasets, our method significantly outperforms state-of-the-art baselines in both reconstruction accuracy and relighting quality.
[CV-131] Online TT-ALS for Streaming Tensor Decomposition with Incremental Orthogonalization
链接: https://arxiv.org/abs/2606.31061
作者: Hiroki Takeda,Yuto Miyatake,Daisuke Furihata
类目: Numerical Analysis (math.NA); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 19 pages, 7 figures. The Julia source code is available at this https URL
Abstract:Tensor Train (TT) decomposition is a powerful technique for analyzing high-dimensional data. Existing algorithms for computing TT decompositions can be categorized into two main types: conventional batch-based approaches and recursive online methods. In the context of streaming data, batch methods typically achieve higher reconstruction accuracy but often suffer from memory exhaustion, while online methods provide greater computational efficiency. In this work, we introduce Online TT-ALS (Alternating Least Squares), an algorithm that sequentially enforces orthogonality constraints. This approach allows for efficient and exact updates of the core tensor while maintaining high reconstruction accuracy. Theoretically, we prove that enforcing these orthogonal gauge constraints guarantees monotonic decrease of the local objective function and temporal smoothness. Computationally, our deterministic single-sweep update reduces the rank dependence from quadratic to linear, achieving an overall complexity of \mathcalO(I^n-1 r) . Experimental results demonstrate that the proposed method outperforms existing online techniques not only in terms of mathematical approximation accuracy but also in human perception-based video quality metrics. Furthermore, compared to recent deep learning-based paradigms, our algebraic approach achieves speedups of several orders of magnitude. Consequently, our method exhibits high computational efficiency and is suitable for low-latency real-time processing applications.
[CV-132] Learning Video Dynamics with Predictive Differentiable Rendering ECCV2026
链接: https://arxiv.org/abs/2606.31050
作者: Yujin Tang,Tian Zhou,Xin Lin,Cheng Tan,Yifan Hu,Rong Jin,SouYoung Jin,Liang Sun
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECCV 2026. 18 pages, 5 figures, 11 tables
Abstract:How to accurately predict a high-fidelity future world? While the visual world is inherently continuous, existing deterministic video prediction models operate in discrete pixel space and are mainly optimized with pixel-wise mean squared error (MSE), which often leads to over-smoothed predictions and a lack of fine-grained visual details. To address these limitations, we propose Predictive Differentiable Rendering (PDR), a novel end-to-end video prediction paradigm that bridges the gap between discrete and continuous representations. Inspired by recent progress in 3D reconstruction with 3D Gaussian Splatting, we introduce PredGS, a lightweight and plug-and-play adapter based on 2D Gaussian representation, which could be seamlessly integrated with existing pixel space predictors, significantly improving spatial detail preservation with negligible computational overhead. Furthermore, we develop predgsplat, a CUDA-accelerated differentiable 2D Gaussian renderer supporting arbitrary channels. Each Gaussian is defined by 5 + C learnable parameters (position, scale, rotation, and C channel amplitudes) and achieves up to 10x faster rendering than the baseline. Optimized by a combined L1 and SSIM loss, PDR overcomes the inherent blurring tendencies of MSE Loss, significantly enhancing the prediction performance. Extensive experiments on diverse real-world benchmarks, including TaxiBJ, WeatherBench, KTH, and Human3.6M, demonstrate that PDR consistently surpasses existing methods, delivering superior detail preservation, visual fidelity, and predictive accuracy.
[CV-133] rraDiT-Ω: Unified Spatial Control for Satellite Image Synthesis with Any Geospatial Primitive
链接: https://arxiv.org/abs/2606.31029
作者: Brian Wei,Srikumar Sastry,Daniel Cher,Eric Xing,Nathan Jacobs
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: European Conference on Computer Vision 2026
Abstract:Generative models have achieved remarkable progress, yet applying them to satellite imagery remains challenging. Unlike natural imagery, satellite scenes are structured by spatially complex and semantically distinct geometries. Prior work addresses this complexity by adapting natural image frameworks using dense rasters or sparse prompts, trading off annotation cost and fidelity while breaking compatibility with vector primitives commonly used to represent geographic information. We introduce TerraDiT- \Omega , a unified spatial control framework that generates satellite imagery directly from any native geospatial primitive. By jointly leveraging precise annotations (polygons, polylines) and coarser ones (bounding boxes, points), the model supports controllable layouts across varying annotation budgets, broadening applicability to design tasks such as urban planning while remaining naturally compatible with end-to-end GeoAI workflows. To effectively leverage these primitives during generation, we propose Geometry-Aware Local Attention, a conditioning mechanism that injects explicit geometric cues into the attention space. Across all conditioning formats, our approach consistently outperforms both dense-control and sparse-control baselines. Furthermore, this flexibility enables controllable synthetic data augmentation using a single generative model, improving downstream performance on land-cover segmentation, object detection, road graph extraction, and scene classification. Code, data, and weights are available at this https URL.
[CV-134] WarpI2I: Image Warping for Image-to-Image Translation ECCV2026
链接: https://arxiv.org/abs/2606.31018
作者: Shen Zheng,Anurag Ghosh,Gaurav Parmar,Srinivasa Narasimhan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ECCV 2026
Abstract:Image-to-image (I2I) translation has achieved strong results in tasks like human relighting and driving scene translation using latent diffusion models (LDMs). However, compact LDMs often struggle to preserve fine-grained structures because the encoder compresses high-resolution inputs into a spatially downsampled latent space. To address this issue, we propose a simple saliency-guided warp-unwarp framework that reallocates spatial representation toward salient regions before encoding, enabling better preservation of structural details without increasing latent resolution. The warped image is processed by the original diffusion model and then mapped back via an inverse warp. In addition, we propose a simple and efficient outpainting-based synthetic data generation pipeline to produce high-quality paired data for image relighting. Our method is model-agnostic, requires no architectural modification, and introduces negligible computational overhead. Experiments on human relighting, driving scene relighting, and translation demonstrate improved structural preservation, lighting faithfulness, and image quality, with our framework extending naturally to video via frame-by-frame application with good temporal stability. Project Webpage: this https URL
[CV-135] Dual Sparse Aggregation Transformer for Multispectral Object Detection
链接: https://arxiv.org/abs/2606.31015
作者: Wencong Wu,Xiuwei Zhang,Hanlin Yin,Hongxi Zhang,Yanning Zhang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Transformer-based approaches have obtained excellent performance in multispectral object detection tasks due to their ability to model long-range dependencies and capture complementary information. However, previous transformer-based multispectral detection methods tend to use all available tokens for similarity calculation, which results in redundant information interaction from irrelevant areas, leading to degraded detection performance. To overcome this challenge, we propose a novel Dual Sparse Aggregation Transformer (DSAFormer) for multispectral object detection, which consists of a Dual Sparse Transformer (DSFormer) and a Learnable Addition Fusion Block (LAFB). Specifically, the DSFormer is designed to exploit and boost cross-modal complementary information, thereby improving detection performance. It incorporates three key components: A Spatial Sparse Multi-Head Cross-Attention (SSMHCA) mechanism selectively captures cross-modal relationships at the spatial level by reserving only the high query-key similarity scores, eliminating irrelevant interactions. A Channel Sparse Multi-Head Cross-Attention (CSMHCA) mechanism performs similar sparse calculations at the channel level to enhance feature representation and filter out low matching query-key. A Multi-Scale Feature Refinement Layer (MSFRL) is developed to aggregate hierarchical features and suppress redundant information. To effectively fuse multimodal features, the LAFB is introduced to aggregate intramodal and intermodal feature information by feature reweighting. Extensive experimental results have demonstrated that our proposed DSAFormer achieves better detection performance against state-of-the-art methods on four public datasets, including the MFAD, FLIR, M ^3 FD, and LLVIP. The source code of our DSAFormer will be released at this https URL.
[CV-136] Dense Structural Priors for Sparse Functional Landmark Localization in Surgical Videos
链接: https://arxiv.org/abs/2606.31007
作者: Chenyan Jing,Hao Ding,Lalithkumar Seenivasan,Jacob M. Delgado López,Mathias Unberath
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision foundation models such as SAM 3 can provide transferable object-level structure across diverse surgical video conditions, but segmentation outputs do not explicitly encode the action-conditioned semantics that define functional surgical landmarks. Estimating instrument extent and geometry differs from localizing the tip or anchor relevant to clipping, grasping, or dissecting. We investigate vision foundation model-enabled sparse action-aware landmark localization, using zero-shot, point-prompted structural masks to provide dense instrument-level context without manual pixel-level mask annotations. We propose a lightweight refinement framework that uses SAM 3 as a structural prior. A coarse multi-frame network predicts tip and anchor prompts, generating non-oracle masks that are fused with visual and heatmap features to refine functional landmark predictions. We compare direct mask-augmented supervision, prediction-derived mask-prior refinement, and auxiliary mask supervision to examine how vision foundation model-derived structure should enter a precision-oriented localization system. Experiments on 7,867 clips from 60 surgical videos spanning YouTube, Cholec80, HeiChole, SurgVU, and CRCD evaluate the approach under heterogeneous conditions. Without manual pixel-level mask annotations for training, the proposed model achieves overall F1 scores of 72.4% for tip and 58.0% for anchor localization. Directly imposing masks on heatmap targets biases learning toward broad tool regions, whereas prediction-derived priors and auxiliary supervision provide effective intermediate structural guidance for action-dependent landmark prediction.
[CV-137] Auditing Generalization in AI-Generated Video Detection: A Six-Control Protocol and the VidAudit Toolkit
链接: https://arxiv.org/abs/2606.31004
作者: Mert Onur Cakiroglu,Zhihe Lu,Mehmet Dalkilic,Hasan Kurban
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:AI-generated video detection benchmarks such as GenVidBench and AIGVDBench are the de facto leaderboards, yet most evaluation protocols leave uncontrolled confounds that can inflate reported generalization. As an existence proof, a three-feature clip-length classifier reaches a leave-one-generator-out (LOGO) AUC of 0.998 on GenVidBench under unaudited evaluation, while measuring nothing about motion. A 20-paper survey finds none applying all six standard controls that would catch this, so we combine them into an audited protocol and apply it to six representative feature sources (three published detectors and three repurposed signal sources), re-running it cross-dataset on AIGVDBench. The audit both debunks and certifies: the trivial classifier collapses to near chance (0.529), a CLIP baseline is caught carrying dataset identity, and the 2025 forensic detector WaveRep clears the floor at out-of-distribution LOGO AUC 0.996 with chance-level real-vs-real coherence. At a deployable FPR of 0.1%, multiple high-AUC methods fall to single-digit recall and the leaderboard order changes, so we recommend an audited tuple (AUC, above-floor margin, operating-point recall, and calibration) over a single number. As a white-box positive control, we add TemporalSpec (codec motion vectors); via cross-substrate feature fusion (XSFF), a second substrate adds genuine complementarity that survives the audit. We release VidAudit, to our knowledge the largest unified and audited detector collection for this task, providing 14 detectors behind one plugin API, a leaderboard, and Croissant metadata, available at this https URL. Together, the protocol and toolkit move evaluation from leaderboard rank toward whether a result measures what it claims.
[CV-138] PhotoQuilt: Training-Free Arbitrary-Resolution Photomosaics via Bootstrapped Tiled Denoising
链接: https://arxiv.org/abs/2606.30968
作者: Koorosh Roohi,Javad Rajabi,Andrew Fleet,Babak Taati
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 pages, 9 figures. Project page: this https URL
Abstract:Photomosaics are large images whose local regions are seen as independent tiles while their overall arrangement forms a coherent scene. Generating them at high resolution, with every tile convincing in its own right, is computationally expensive, since the canvas must hold many detailed tiles at once. We present PhotoQuilt, a training-free framework that generates photomosaics at arbitrary resolution. Diffusion models struggle to satisfy both scales at once, as direct high-resolution generation is costly and tends toward one smooth image rather than a mosaic, while patch-based tiling keeps local detail but loses global structure. PhotoQuilt resolves this with a bootstrapped tiled denoising procedure. We first produce a global composition at low resolution to fix the layout, then upscale it in latent space and re-inject noise to restore generative capacity. Denoising proceeds within fixed tiles, so each forms its own image while the shared global structure holds them in one layout. Because tile generation is handled separately, PhotoQuilt scales to large canvases without quadratic attention cost. Experiments show that PhotoQuilt outperforms current baselines on both global structure and local realism.
[CV-139] Learning Where to Look: A Reinforcement Learning Framework for Robust Micro-Ultrasound Prostate Cancer Detection MICCAI2026
链接: https://arxiv.org/abs/2606.30951
作者: Mohammad Mahdi Abootorabi,Sina Namazi,Armin Saadat,Lyuyang Wang,Obed Dzikunu,Paul F. R. Wilson,Zhuoxin Guo,Brian Wodlinger,Parvin Mousavi,Purang Abolmaesumi
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Early Accept at MICCAI 2026 (top 9%)
Abstract:Micro-ultrasound ( \mu US) is a new, emerging, and promising imaging modality for prostate cancer (PCa) detection, but accurate identification of suspicious tissue remains highly dependent on clinical experience, leading to substantial inter-observer variability. Machine-learning assistance can reduce this variability; however, training reliable deep models is challenging because supervision is sparse and noisy – typically limited to core-level histopathology outcomes (e.g., cancer grade and its percentage in a biopsy core) without pixel-level lesion annotations and under severe class imbalance. We introduce Prost-RL, which reframes \mu US PCa detection as a spatially aware, policy-driven inference problem by learning where to look before decoding. Prost-RL integrates a lightweight reinforcement-learning policy into a foundation-model encoder-decoder to generate interpretable spatial attention maps that act as soft prompts for both cancer-likelihood heatmap prediction and image-level classification. We further propose Adaptive Policy Optimization (APO) to stabilize hybrid supervised-RL training and a noise-robust objective combining symmetric cross-entropy with negative-entropy regularization to mitigate weak-label noise and encourage sharp localization. On a cohort of 6,607 biopsy cores from 693 patients across five clinical sites, Prost-RL achieves 79.0\pm3.5 AUROC with 64.6\pm6.3 % sensitivity at 80% specificity for core-level detection (+2.1 AUROC and +4.5 sensitivity points over the strongest baseline), and 79.3\pm5.8 AUROC for clinically significant cancer classification. The learned policy highlights biopsy-aligned regions, providing transparent, spatially grounded evidence alongside quantitative risk predictions. Code is available at: this https URL.
[CV-140] No Adaptation Without Observation: Observability-Constrained Test-Time Prompt Tuning for LiDAR Semantic Segmentation IROS2026
链接: https://arxiv.org/abs/2606.30937
作者: Linlian Jiang,Wentao Ju,Sadman Rakib Pinon,Jianwei Xian,Zhixiang Chi,Xinxin Zuo,Yang Wang
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IROS 2026
Abstract:LiDAR semantic segmentation often degrades under real-world deployment due to evolving sensing conditions, while collecting new annotations for retraining is impractical. Test-time adaptation (TTA) updates model parameters online using pseudo-label supervision, but directly applying standard TTA strategies to LiDAR data is challenging. Because pseudo-label reliability is spatially heteroscedastic under range-dependent sparsity and occlusion, uniform updates on globally shared parameters can inject unstable gradients and destabilize adaptation. We propose a geometry-constrained test-time prompt tuning framework for LiDAR semantic segmentation. Our method estimates per-location sensing reliability from depth-consistent beam terminations and neighborhood support, and uses it to reweight spatial supervision. Adaptation is confined to lightweight prompt adapters inserted into a frozen backbone, with spatial gating to prevent unreliable regions from perturbing globally shared representations. A temporally smoothed prototype alignment strategy further stabilizes online updates by accumulating reliable semantic evidence over time. Experiments on standard LiDAR benchmarks demonstrate improved adaptation stability and segmentation performance under deployment variations without additional annotations.
[CV-141] GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis
链接: https://arxiv.org/abs/2606.30901
作者: Rasul Khanbayov,Hasan Kurban
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prototype-based medical image classifiers present three clinical limitations: they treat findings as independent, silently amplify unsafe physician feedback, and require full retraining whenever a new finding is needed. We present GRAPE (Graph-Augmented Prototype Explanations), a unified architecture that addresses all three challenges. First, a Graph Attention Task Head models anatomical concept co-occurrence, boosting macro-F1 by +13.8,pp over the prototype baseline on TBX11K. Second, a Concept-Mismatch Safety Check - the first such mechanism in prototype-based medical classifiers - warns when the model’s dominant finding inside a doctor-drawn region conflicts with the claimed label, catching 85% of erroneous annotations versus 51% for MC-Dropout with no extra inference cost. Third, Open-Vocabulary Prototype Anchoring aligns visual prototypes to clinical text, allowing a new finding to be added from a single labeled image without modifying any other component. On NIH ChestX-ray14, one Effusion example recovers full-supervision localization accuracy; on TBX11K, prototype maps achieve 2.6x better lesion localization than end-to-end baselines. All three capabilities add only +1~ms latency at interactive batch size. The project page is this https URL.
[CV-142] Knowledge-Driven Dimension Estimation from a Single Image -3D Asset Generation Technology for Digital Twin Construction
链接: https://arxiv.org/abs/2606.30896
作者: Hidenori Sakaniwa,Akihito Akai,Akihiko Hyodo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures
Abstract:In the verification of in-vehicle cameras, simulation technology using virtual spaces has advanced, enabling pre-evaluation of false detections and missed detections in various scenarios. However, discrepancies in the scale of the object being verified between the virtual and real environments can lead to a decrease in camera recognition performance. For traffic signs installed at high altitudes, distance measurement using LiDAR or stereo cameras is difficult, requiring size estimation from monocular images. This paper proposes a method for estimating the scale of an object by decomposing it into multiple structural elements and integrating external knowledge regarding design rules, geometric relationships, and conventional dimensions. Specifically, this method detects each component from a monocular image and estimates the size of each component by considering its structural relationships and dimensional consistency with surrounding elements. Furthermore, it generates a 3D asset of the object by reconstructing the estimated components. This method makes it possible to place 3D assets with a scale approximating the real environment within a digital twin space and is expected to contribute to improving the verification accuracy of in-vehicle cameras for autonomous driving in virtual environments.
[CV-143] he Label Imitation Game: Turing Test Network for Zero-Shot Pseudo-Label Pruning ECCV2026
链接: https://arxiv.org/abs/2606.30875
作者: Brent A. Griffin,Jason J. Corso
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ECCV 2026
Abstract:Foundation model pseudo-labeling - labeling data strictly via zero-shot inference - enables massive scale, but performance is undermined by hallucinations that evade standard thresholds. To eliminate these errors, we introduce the Turing-inspired Label Imitation Game (LIG), a framework that formalizes pseudo-label pruning as an adversarial interrogation. Rather than filtering labels via isolated thresholds, we use the LIG to train a Turing Test Network (TTN), a task-agnostic “judge” that evaluates candidate pseudo-labels within a dataset-wide context. Experiments across four diverse datasets demonstrate the TTN’s robustness, consistently enhancing label accuracy for three state-of-the-art vision-language models without costly supervision or retraining. Crucially, we demonstrate that learned semantic-contextual logic is a robust alternative to spatial-geometric verification, enabling a unique zero-shot task transfer capability - a TTN trained strictly on image classification datasets can effectively prune complex object detection pseudo-labels. This pruning yields F1-score gains of 28% for the worst-performing baseline categories and 44% with task-specific fine-tuning. Significantly, we also observe Category Revival, where the TTN pruning “detoxifies” the training signal for downstream models and enables them to recover from zero recall on transfer-vulnerable classes. The pre-trained TTN models and code are available at this https URL.
[CV-144] SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation ECCV2026
链接: https://arxiv.org/abs/2606.30849
作者: Juncheng Ma,Yuxuan Du,Yanan Sun,Zhening Xing,Changlin Li,Zhenyu Tang,Bo Li,Peng-Tao Jiang,Li Yuan,Daquan Zhou,Yonghong Tian
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ECCV 2026
Abstract:Diffusion Transformers (DiTs) have significantly advanced audio-driven portrait animation, but their high computational cost leads to substantial inference latency. Although training-free diffusion caching accelerates inference significant, existing methods are primarily developed for text-conditioned generation and overlook the spatial and modality imbalances inherent in audio-driven portrait animation. In this paper, we propose SyncCache, a training-free caching acceleration method tailored for DiT-based portrait animation that explicitly exploits asymmetric dynamics. Specifically, high-frequency dynamics driven by audio conditions and concentrated in human regions are more challenging and critical to cache and reuse than the low-frequency visual background in portrait animation. First, we introduce Spatially-Asymmetric Probing to prioritize error sensitivity in dynamic human region. Second, through Modality-Decoupled Caching, we bypass heavy DiT block by reusing stable inter-block residuals, while continuously recomputing lightweight audio blocks to preserve precise lip synchronization. Furthermore, we introduce a cache ratio to control cache capacity and formulate memory-adaptive cache selection as an offline dynamic programming problem without online overhead. Extensive experiments demonstrate that SyncCache achieves superior speed-quality trade-offs, delivering up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.
[CV-145] AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation ECCV2026
链接: https://arxiv.org/abs/2606.30811
作者: Kien T. Pham,I Chieh Chen,Qifeng Chen,Long Chen
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: ECCV 2026
Abstract:Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present \textbfAVTok, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.
[CV-146] GaussLite: Online Task-Conditioned 3D Gaussian Splatting for Real-Time Robotic Mapping
链接: https://arxiv.org/abs/2606.30809
作者: Annika Thomas,Mason Peterson,Jonathan P. How
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Existing 3D Gaussian Splatting (3DGS) systems distribute representation capacity uniformly across a scene, ignoring the fact that many downstream robotic tasks engage only a fraction of the reconstructed geometry. This causes valuable onboard compute to be allocated towards optimizing irrelevant parts of the scene, either limiting online capacity or under-optimizing the most relevant parts of the scene. We introduce GaussLite, a task-driven 3DGS mapping system that conditions its representation density on a natural-language task specification. Given a posed RGB-D stream and a task such as “prepare to pick up the object on the desk,” GaussLite uses a one-shot LLM parser to extract target and anchor objects, which are grounded per-frame by an open-vocabulary detector and segmented to produce per-pixel relevance masks in real time. The mapper allocates seeding density, gradient flow and scaling by task relevance. At matched Gaussian budget and real-time mapping at 4 Hz on resource-constrained hardware, GaussLite outperforms baselines on ROI PSNR on the Replica Dataset by an average +2.72 dB and on a real-hardware demonstration in indoor and outdoor settings by +2.23 dB. We further show that two task-specialized agents’ maps can be fused into a single shared map via per-voxel voting on active-optimization counts in real time, outperforming concatenation by +3.42 dB while only sharing an average 7.08% of the map.
[CV-147] Off the Rails: Hijacking the Scoring Head in Generative End-to-End Driving Planners with Safety-Violating Adversarial Perturbations
链接: https://arxiv.org/abs/2606.30807
作者: Halima Bouzidi,Mboutidem Ekemini Mkpong,Haoyu Liu,Mohammad Abdullah Al Faruque
类目: Robotics (cs.RO); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 4 figures, 9 tables
Abstract:Generative models have recently seen rapid adoption in End-to-End (E2E) autonomous driving (AD), with diffusion-based denoising and vocabulary-based retrieval becoming the dominant trajectory-decoding paradigms. Despite their architectural diversity, current generative AD planners share a common inference pattern: a fixed set of candidate trajectories (anchors, vocabulary entries, or proposal queries) is scored by one or more learned heads conditioned on the Bird’s-Eye-View (BEV) features, and the highest-scored candidate is returned as the final trajectory. Under this design, the scoring head is the only barrier between perception and the motion command, and its decision margins between competing candidates are often small. We introduce \textscDerail, an adversarial framework that exploits this scoring-head attack surface. Evaluated on various generative planners, \textscDerail flips the trajectory selection from a safe to an unsafe candidate, with score drops of 39 – 80% and collision rates of up to 50% , consistently outperforming generic loss-maximization and feature-divergence attacks. Our analysis suggests that safety-violating objectives govern attack effectiveness against generative AD planners, and that the scoring-head inference pattern itself is a recurring attack surface worth explicit defensive consideration.
[CV-148] Simple Supervision Is Hard to Beat: A Bitter Lesson from Sparse Target Labels in Domain-Adaptive Object Detection
链接: https://arxiv.org/abs/2606.30795
作者: Lijun Zhang,Ruinian Xu,Mudit Agrawal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-free domain adaptive object detection adapts a source-trained detector to an unlabeled target domain, typically through teacher-student self-training with pseudo-labels. We revisit this setting when a small, uniformly sampled subset of target images is labeled. We introduce Random-Target Supervised Mixing (RTSM), a simple anchor that incorporates these annotations through a supervised detection loss while leaving the original unlabeled adaptation branch unchanged. Across evaluations spanning four SFDA-OD methods, two object detectors, multiple adaptation tasks, and target-label budgets from 1% to 10%, RTSM consistently improves pure SFDA by 1.7 to 18.3 AP50. We then examine whether the same annotations can provide further gains by steering unlabeled self-training. To this end, we evaluate ten sparse-label feedback plugins covering pseudo-label selection, object completion, and optimization control, which yield limited and method-dependent gains over RTSM. These results reveal a bitter lesson for sparse-label SFDA-OD: simple supervision is hard to beat. RTSM therefore provides a simple yet effective anchor for sparse-label SFDA-OD.
[CV-149] Unveiling Transferability in Trajectory Prediction via Latent Scene Embeddings ECCV2026
链接: https://arxiv.org/abs/2606.30777
作者: Theodor Westny,David Axelsson,Björn Olofsson,Erik Frisk
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ECCV 2026
Abstract:The growing availability of trajectory datasets has fueled major advances in data-driven motion prediction. Yet, models trained on one dataset often fail to generalize beyond their training domain as a result of differences in scene layouts, agent behaviors, and sensing conditions. A framework that learns latent representations of datasets and quantifies their similarity using distributional metrics is presented. This large-scale study covers 24 major datasets, including the most widely used motion-prediction benchmarks, and shows that the resulting transferability scores strongly correlate with cross-dataset model performance. The results provide practical guidance for dataset selection, pretraining, and large-scale foundation models for motion prediction, paving the way toward more generalizable and robust predictive systems.
[CV-150] Streaming Gaussian Encoding for 4D Panoptic Occupancy Tracking IROS2026
链接: https://arxiv.org/abs/2606.30754
作者: Maximilian Luz,Thomas Nürnberg,Yakov Miron,Abhinav Valada
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to the 2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2026)
Abstract:Camera-based 4D panoptic occupancy tracking (4D-POT) is a promising paradigm for holistic scene understanding from multi-view imagery, enabling joint reasoning about geometry, semantics, and object identities across time. Recent mask-based pipelines achieve strong performance by propagating instance queries across frames. However, their underlying volumetric representations are typically recomputed at each timestep, limiting geometric temporal consistency, particularly under occlusion and for static scene elements. To address this limitation, we propose a streaming Gaussian encoder that maintains a persistent volumetric scene representation for 4D-POT. Our method models the scene as a fixed-size set of latent Gaussian queries that are propagated via ego-motion compensation and refreshed under a confidence-guided budget constraint. Crucially, we shape Gaussian opacities through depth-based supervision to serve as proxy for visibility, enabling confidence to accumulate as a temporally aggregated measure of persistent scene support. Together with a warmup-based multi-frame training strategy, this yields representation-level temporal coherence beyond decoder-only tracking. Extensive experiments on Occ3D-extended nuScenes and Waymo establish a new state-of-the-art for camera-based 4D-POT, improving tracking consistency with negligible computational overhead while remaining fully compatible with existing mask-based pipelines. We provide code and models at this https URL.
[CV-151] LUMOS: A Semantic Operating-System Layer for Accessibility-Grounded AI Agents
链接: https://arxiv.org/abs/2606.30697
作者: Yogeswar Reddy Thota
类目: Operating Systems (cs.OS); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: The LUMOS repository is available at this https URL
Abstract:Current operating systems expose interfaces optimized for human users but not for AI agents. Humans benefit from pixels, icons, windows, visual grouping, mouse movement, and keyboard shortcuts; AI agents instead need compact semantic state, grounded actions, and reliable feedback. As a result, many computer-use agents are forced to interpret screenshots, OCR output, and visual crops, introducing high token costs, visual ambiguity, latency, and coordinate uncertainty. This paper introduces LUMOS (Language Model Unified Machine-Readable Operating-System Semantics), a semantic interaction layer between AI agents and operating systems. LUMOS converts native accessibility metadata and browser UI structures into machine readable semantic blueprints with stable identifiers, roles, names, values, bounds, and action affordances. It also supports live semantic pointer grounding by querying the UI element under or near the cursor through operating-system automation APIs. An LLM then acts through an accessibility grounded observe act loop using constrained visible-UI primitives rather than application-specific scripts. LUMOS does not claim to replace visual agents; instead, it reduces dependence on screenshots when operating systems already provide semantic structure. These results suggest a path toward AI-native operating systems and machine-readable interaction layers.
[CV-152] DANTE-W: Diffuse Albedo Neural Texturing in the Wild
链接: https://arxiv.org/abs/2606.30677
作者: Guangyu Wang,Tianheng Lu,Ruqi Huang,Lu Fang
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Classical mesh texturing techniques blend captured multi-view images directly, which inevitably suffer from baked-in shading and casted shadows that compromise visual fidelity during relighting. To circumvent this issue, we present a neural texturing framework, namely DANTE-W, to enable high-fidelity diffuse albedo texture recovery from unstructured image collections for large-scale, in-the-wild scenes, which integrates seamlessly with traditional 3D reconstruction pipelines. Given a reconstructed mesh and its surface parameterization, our method fuses view-space generative albedo priors into a coherent texture space via an expressive neural representation, while substantially enhancing fine-grained textural details through physically principled neural rendering. To comprehensively evaluate our method, we curate a benchmark dataset featuring diverse, fine-grained textures, comprising both real-world in-the-wild scenes and synthetic objects. Extensive experiments verify the effectiveness of our approach in reconstructing accurate albedo textures and boosting relighting fidelity. Project page: this http URL.
[CV-153] PolyFlow: Continuous Topology Embedding Flow Matching for Artist-style Mesh Generation
链接: https://arxiv.org/abs/2606.30673
作者: Chunshi Wang,Haohan Weng,Junliang Ye,Biwen Lei,Yang Li,Zibo Zhao,Zeqiang Lai,Kaiyi Zhang,Yunhan Yang,Zhuo Chen,Chunchao Guo,Yawei Luo
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autoregressive Transformers dominate high-quality mesh generation by producing artist-worthy topologies, yet their inherent sequential decoding induces substantial computational overhead, falling orders of magnitude slower than parallel generative models. On the other hand, while continuous diffusion and flow-matching methods support efficient parallel synthesis across a variety of domains, they cannot be directly applied to meshes: mesh connectivity is inherently discrete and incompatible with standard continuous noise injection and denoising operations. To resolve this fundamental incompatibility, we introduce a compact topology embedder that projects discrete mesh vertex positions and normals into continuous per-vertex embeddings, where the original discrete adjacency information can be faithfully recovered via spacetime distance thresholding. After pretraining and freezing this embedder, any raw mesh can be fully converted into a continuous per-vertex state space unifying position, normal, and implicit topological attributes. Built upon this novel continuous mesh representation, we present PolyFlow, a Transformer-based flow-matching framework that achieves fully parallel vertex state denoising conditioned on extracted point-cloud features. During inference, our model completes generation rapidly via an ODE solver, and supports explicit, precise control over output mesh resolution by directly specifying the target vertex count. Extensive evaluations on the Toys4K benchmark demonstrate that PolyFlow surpasses state-of-the-art autoregressive baselines in both Chamfer Distance and Hausdorff Distance.
[CV-154] Cross-Modal Hierarchical Fusion for from Multi-Sensor Ground Observation
链接: https://arxiv.org/abs/2606.30647
作者: Xinze Zhang
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Dense volumetric reconstruction of cloud microphysical fields from sparse ground-based instruments remains an open problem, largely because the available measurements are heterogeneous in both modality and spatial coverage. We present AtmoFuseNet, a framework that fuses multi-view sky camera imagery with millimeter-wave cloud radar and ceilometer observations to produce 4D (three spatial dimensions plus time) estimates of cloud state and wind. The method operates in three stages: a cross-modal hierarchical aggregation module that combines image feature pyramids with instrument-derived vertical profiles through layer-wise cross-attention; a conditional variational refinement module that maps the resulting volume to physically consistent microphysical fields under differentiable radar and image forward models; and a correlation-based motion estimator that recovers per-voxel 3D wind vectors from consecutive volumetric reconstructions. On collocated observations from a semi-arid site, AtmoFuseNet reaches 0.026 g m^-3 liquid water content MAE and 1.18 m s^-1 wind speed MAE, improving over existing retrieval baselines. Ablation experiments isolate the contribution of each module.
[CV-155] Distortion-Corrected Diffusion MRI Using Rotated-View EPI and Joint Field-Map/Image Estimation with Gaussian Primitives
链接: https://arxiv.org/abs/2606.31521
作者: Wenqi Huang,Zhitao Li,Nan Wang,Yimeng Lin,Mengze Gao,Yurui Qian,Sevgi Gokce Kafali,Xiaozhi Cao,Kawin Setsompop,Daniel Rueckert,Congyu Liao
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注:
Abstract:Echo Planar Imaging (EPI) is the standard acquisition technique for diffusion and functional neuroimaging, enabling rapid imaging but suffering from geometric distortions caused by B0 field inhomogeneities. Existing correction methods first reconstruct distorted images using parallel imaging, then estimate the B0 field and correct the distortion in the image domain. In this sequential process, reconstruction artifacts at high acceleration factors and low SNR at high diffusion b-values degrade B0 estimation and limit the overall correction quality. We propose a physics-informed framework that jointly estimates the B0 field and distortion-free image directly from k-space data, without depending on an intermediate parallel-imaging reconstruction for the correction. The image and the B0 field are each represented as a superposition of Gaussian primitives embedded within an MRI physics forward model. The explicit, continuous parameterization captures both smooth regions and tissue boundaries and supports rotated-view EPI acquisitions without interpolation. The diffusion-weighted image is modeled as real and non-negative, with the image phase absorbed into a per-shot phase factor. Rotated views distribute distortions across multiple phase-encoding orientations, improving point spread function isotropy and providing stronger constraints for B0 estimation. On in vivo brain diffusion EPI, the proposed method attains the closest brain-boundary agreement with a distortion-free structural reference, with the largest improvement over sequential methods at high b-value and high acceleration. Extensive visual comparisons further show improved detail fidelity and noise suppression.
[CV-156] Accelerating Merge with Motion Vector Difference via Filter Difference Analysis for VVenC
链接: https://arxiv.org/abs/2606.31084
作者: Xinmin Feng,Shengyang Xu,Jianhua Chen,Li Li,Dong Liu,Feng Wu
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 tables, 4 figures
Abstract:Merge with Motion Vector Difference (MMVD) is a key coding tool in Versatile Video Coding for improving motion prediction accuracy. However, its exhaustive search strategy imposes a significant computational burden on the encoder. To address this issue, we propose a novel fast MMVD algorithm for the VVenC encoder based on fractional motion vector filter difference analysis. By approximating the 8-tap interpolation filter with a 2-tap filter, we derive a criterion based on spatial gradients and prediction residuals for estimating the potential gain of MMVD candidates. We further generalize this criterion to accommodate both shifted integer reference samples and 2D separable filtering. To minimize the overhead of the proposed method, we introduce implementation optimizations, including symmetric offset inference and cross-shaped downsampled dot-product computation. Compared with existing fast MMVD algorithms in VVenC, our method reduces the average MMVD search ratio from 21.07% to 11.05% and decreases the efficiency-complexity metric \eta from 11.79 to 7.10 under the fast preset.
人工智能
[AI-0] Freeform Preference Learning for Robotic Manipulation
链接: https://arxiv.org/abs/2606.32027
作者: Marcel Torne,Anubha Mahajan,Abhijnya Bhat,Chelsea Finn
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Reward design remains a central bottleneck for autonomous robot policy improvement, especially in long-horizon manipulation tasks where sparse success labels provide too little signal and binary preferences collapse many competing notions of quality into one ambiguous signal. We introduce Freeform Preference Learning (FPL), a method for learning robot policies from freeform human preferences. Rather than asking annotators which of two trajectories is better overall, FPL lets them define natural-language preference axes, such as speed, safety, quality of placement, or carefulness, and provide pairwise preferences along each axis. These annotations are used to learn a language-conditioned reward model that maps a trajectory and preference label to an axis-specific reward. We use this model to train a reward-conditioned policy that optimizes across the multiple human-specified dimensions. Across four real-world and two simulated long-horizon manipulation tasks, FPL improves over sparse-reward and binary-preference methods by 38 percentage points. Beyond improved performance, FPL learns dense progress signals without explicit subtask segmentation, shows compositionality of behavior not present in the data, and allows users to steer the policy towards different behaviors at test time without retraining. Blog post with videos available at this https URL
[AI-1] AdaJEPA: An Adaptive Latent World Model
链接: https://arxiv.org/abs/2606.32026
作者: Ying Wang,Oumayma Bounou,Yann LeCun,Mengye Ren
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Latent world models enable planning from high-dimensional observations by predicting future states in a compact latent space. However, these models are typically kept frozen at test time: when their predictions become inaccurate, planning can fail, especially under test-time distribution shift. To address this, we propose AdaJEPA, an adaptive latent world model that performs test-time adaptation within the closed loop of model predictive control (MPC). After training, AdaJEPA plans and executes the first action chunk, uses the observed next-state transition as a self-supervised adaptation signal, and replans with the updated model. This closed-loop update continuously recalibrates the world model without additional expert demonstrations. Across a range of goal-reaching tasks, AdaJEPA substantially improves planning success with as few as one gradient step per MPC replanning step.
[AI-2] RIAGE: Role-Typed Credit Assignment for Agent ic Reinforcement Learning
链接: https://arxiv.org/abs/2606.32017
作者: Yuanda Xu,Zhengze Zhou,Hejian Sang,Xiaomin Li,Jiaxin Zhang,Xinchen Du,Zhipeng Wang,Alborz Geramifard
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic reinforcement learning requires assigning credit to environment-facing actions such as searches, clicks, edits, navigation commands, and object interactions. Standard GRPO uses the final verifier outcome as a uniform advantage over all action tokens. This outcome signal is useful but structurally incomplete: it punishes useful exploration in failed rollouts and reinforces redundant or regressive actions in successful rollouts. We propose TRIAGE, a role-typed credit assignment framework that adds a semantic role axis to outcome credit. A structured judge classifies each segment as decisive progress, useful exploration, no-progress infrastructure, or regression, and a fixed role-conditioned rule maps these labels to bounded segment-level process rewards. This keeps verifier outcomes as the source of optimization direction while correcting the two main blind spots of outcome-only credit. We further show that role-conditioned credit is the optimal segment-level correction expressible from role labels alone – a projection of the per-segment advantage residual onto the role variable – so that the fixed role constants reduce advantage estimation error whenever the judge is reliable, and we connect this to lower-variance policy gradients. Across ALFWorld, Search-QA, and WebShop, TRIAGE improves success rates over GRPO for two policy models and outperforms both a scalar judge-derived process reward and an outcome-supervised shared-backbone value baseline. Ablations show that the gain comes from role typing rather than merely adding dense rewards: reliable detection of regression inside successful trajectories is the dominant contributor, while exploration credit provides a consistent secondary gain; on completed ALFWorld and WebShop rollouts, TRIAGE also reduces environment-facing turns by an additional 10.4% and 14.8% relative to GRPO.
[AI-3] AxDafny: Agent ic Verified Code Generation in Dafny
链接: https://arxiv.org/abs/2606.32007
作者: Benjamin Breen,Austin Letson,Borja Requena Pozo,Leopoldo Sarra
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification. We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments. We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness. On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance. On DafnyBench, AxDafny achieves 92.7% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points. Lastly, we show that verification success and runtime test performance measure different aspects of generated code.
[AI-4] PolicyGuard: From Organizational Policies to Neuro-SymbolicCompliance Review Engines
链接: https://arxiv.org/abs/2606.32004
作者: Sameer Malik,Ayush Singh,Amar Prakash Azad
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Symbolic Computation (cs.SC)
备注:
Abstract:Policy-grounded document review requires determining whether a target document complies with organization-specific policies, guidelines, or playbooks. While large language models can assist with policy interpretation and document analysis, end-to-end prompting leaves the applied policy logic implicit, making compliance decisions difficult to inspect, update, and test. We present PolicyGuard, a neuro-symbolic framework for policy-grounded document compliance review. PolicyGuard converts organizational policy guidance into an executable review engine consisting of typed relational logic rules and atom-level extraction questions. During review, LLMs answer these local questions using retrieved document evidence, and a symbolic evaluator applies the formal rules to detect non-compliance. We instantiate and evaluate PolicyGuard on company-specific NDA compliance review, where contract clauses must be checked against organization-specific negotiation policies. By separating policy formalization, local document interpretation, and symbolic compliance evaluation, PolicyGuard makes document review more explicit, maintainable, and systematically testable.
[AI-5] Self-Study Reconsidered: The Hidden Frag ility of Learning from Self-Generated QA
链接: https://arxiv.org/abs/2606.32002
作者: Ekaterina Alimaskina,Denis Shveykin,Gleb Molodtsov,Igor Shalygin,Alexey Kadeishvili,Aleksandr Beznosikov
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Language models are increasingly taught from synthetic question–answer (QA) supervision: a model generates questions about a document, answers them from the same text, and the resulting pairs are used to fine-tune, distill, or compress knowledge into another model. We show that this generation step is not neutral preprocessing. It is an implicit policy that both selects which evidence becomes training signal and decides how that evidence is answered, and it is fragile at both stages. When choosing what to ask, generators do not scan a document uniformly. Coverage saturates early and concentrates on salient spans, diverse prompts converge on the same regions, and what looks question-worthy is driven by local presentation. As a result, salient artifacts such as poorly cleaned markup can hijack question generation across model families and scales. When answering, the model that produces the supervision tends to obey instruction-like passages embedded in the text. This compliance depends on the intent and surface form of the passage rather than its strictness, and is worst under task conflict, where larger models comply more often. These failure modes arise from choices made during QA generation, so they can be reduced without changing the training loop. Tying each question to a fixed target reduces biased selection, and filtering instruction-like spans before answering lowers mean injection compliance from 88% to 13% in our evaluation while retaining nearly all clean text.
[AI-6] Radial Suppression Accelerates Algorithmic Generalization: A Geometric Analysis of Delayed Generalization ICML2026
链接: https://arxiv.org/abs/2606.32000
作者: Srijan Tiwari,Aditya Chauhan,Manjot Singh
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures, 10 tables. Presented at the Workshop on High-dimensional Learning Dynamics at the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Why do neural networks memorize algorithmic training data long before they generalize? We present a geometric case study demonstrating that, on tasks where generalization requires discovering structured low-dimensional circuits, the memorization-generalization delay is driven by radial inflation of hidden representations under cross-entropy optimization. We formalize a radial-angular decomposition of activation-space dynamics and derive three testable propositions: (i) that penalizing radial inflation induces anisotropic, data-dependent weight regularization; (ii) that it suppresses radial gradient energy below the isotropic random baseline, forcing predominantly angular updates; and (iii) that it biases convergence toward flatter minima. To empirically validate these propositions, we study a single-hyperparameter norm penalty that softly constrains activations to a sqrt(d)-radius hypersphere. On modular arithmetic, this penalty accelerates grokking up to 6x across MLPs and Transformers, and halves training steps for a 10M-parameter nanoGPT on 3-digit addition.
[AI-7] Amplifying Membership Signal Through Chained Regeneration
链接: https://arxiv.org/abs/2606.31991
作者: Wojciech Łapacz,Stanisław Pawlak
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The tendency of large generative models to memorize training data makes sample verification critical for privacy auditing and copyright enforcement. Current membership (MIA) and dataset inference (DI) attacks often rely on one-shot generations, which yield weak signals and limited sensitivity across modalities. Inspired by Model Autophagy Disorder (MAD), we introduce MADreMIA, a model-agnostic framework that enhances white-, gray-, and black-box MIA and DI. Rather than relying on shadow model training – often infeasible for large generative models – our framework facilitates scalable inference by leveraging inherent signals through iterative trajectories. This process utilizes chained generations across diverse modalities, where each output serves as the subsequent input, to improve membership evidence at low FPR. We demonstrate that memorized training samples exhibit significantly higher coherence and slower degradation during iterative regeneration than non-member generations. Our results show that MADreMIA provides richer signals across diverse model families and modalities; we present comprehensive evaluations for IARs, diffusion, and language models, alongside preliminary results demonstrating its potential for audio models.
[AI-8] LeCropFollow: Latent Space Planning for Navigation in Unstructured Crop Fields
链接: https://arxiv.org/abs/2606.31941
作者: Felipe Tommaselli,Francisco Affonso,Arthur Pompeu,Gianluca Capezzuto,Arun Narenthiran Sivakumar,Girish Chowdhary,Marcelo Becker
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures, 3 tables. Github Repo: this https URL
Abstract:Unstructured navigational features, such as irregular planting or discontinuities, remain the primary failure mode for under-canopy agricultural robots. Existing geometric approaches often fail in these scenarios because they compress high-dimensional visual data into deterministic spatial references, effectively discarding the uncertainty and semantic context required to navigate ambiguous terrain. To address this, we present LeCropFollow, a visual navigation framework that bypasses explicit geometric modeling in favor of a learned latent representation. By integrating a self-supervised semantic heatmap extractor with TD-MPC2, a Model-Based Reinforcement Learning (MBRL) planner, our system optimizes trajectories directly within a latent manifold. The framework operates over the uncompressed heatmap signal, preserving the semantic context that geometric reductions discard. We demonstrate that this representational shift enables zero-shot transfer from simplified simulation to the physical world without fine-tuning. Extensive field experiments in late-stage corn fields show that LeCropFollow matches state-of-the-art baselines in unstructured rows but significantly outperforms them in plantation gaps, achieving a 2.4x reduction in semantic failures compared to keypoint-based methods. These results suggest that latent planning offers a robust alternative to geometric estimation for operations in heterogeneous agricultural environments. Code, models, and data available: this https URL .
[AI-9] Better Understanding Understanding Better
链接: https://arxiv.org/abs/2606.31892
作者: Yu Wei(Department of Philosophy, East China Normal University)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:“Any fool can know; the point is to understand.” A well-known remark often attributed to Einstein captures a widely shared intuition: understanding is more than merely knowing. Yet epistemic logic has paid relatively little attention to understanding, despite its central role in contemporary epistemology, philosophy of science, and recent debates about AI. A recurring theme in the philosophical literature is that, unlike knowledge, understanding comes in degrees: one may understand something more or less well, and one’s understanding may be better than another’s. We introduce a comparative epistemic logic of understanding with level-indexed understanding modalities and a comparative connective for saying that one agent understands why a proposition better than another agent does. Semantically, we enrich multi-agent epistemic models with agent-indexed graded explanation structures and a justification-style term algebra. This yields a unified framework for representing minimal, ordinary, more demanding, and ideal understanding, together with comparisons between agents with respect to the same formula at issue. We distinguish a finitary bounded-level calculus from an infinitary full-language companion system. We establish soundness and strong completeness, and show that each fixed finite-level fragment is decidable.
[AI-10] Modal CEGAR-tableaux with RECAR and resolution-based SAT-shortcuts
链接: https://arxiv.org/abs/2606.31878
作者: Rajeev Goré(Faculty of Information Technology, Monash University, Australia),Cormac Kikkert(Cormac Kikkert Research)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:We investigate two approaches for extending CEGAR-tableaux with SAT-shortcuts using a previously known approach called RECAR but also a totally new approach using the modal resolution theorem prover KSP as an oracle. Our experiments using our C++ implementation CEGARBox++ of CEGAR-tableaux show that: (1) CEGARBox++ with RECAR SAT-shortcuts is not competitive (2) CEGARBox++ using KSP to provide SAT-shortcuts is superior to both CEGARBox++ and KSP, particularly on large satisfiable problems. As far as we know, this is the first effective integration of SAT, tableaux and resolution methods for modal satisfiability which performs better than its parts. Comments: In Proceedings AiML 2026, arXiv:2606.29444 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) ACMclasses: F4.1; I.24; I.2.3 Cite as: arXiv:2606.31878 [cs.LO] (or arXiv:2606.31878v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2606.31878 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 447, 2026, pp. 427-444 Related DOI: https://doi.org/10.4204/EPTCS.447.24 Focus to learn more DOI(s) linking to related resources Submission history From: EPTCS [view email][via EPTCS proxy] [v1] Tue, 30 Jun 2026 15:58:08 UTC (1,021 KB) Full-text links: Access Paper: View a PDF of the paper titled Modal CEGAR-tableaux with RECAR and resolution-based SAT-shortcuts, by Rajeev Gor’e (Faculty of Information Technology and 3 other authorsView PDFTeX Source view license Current browse context: cs.LO prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[AI-11] Belief Contraction in Dynamic Epistemic Logic
链接: https://arxiv.org/abs/2606.31861
作者: Gaia Belardinelli(Stanford University),Snow Zhang(University of Berkeley, California)
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: In Proceedings AiML 2026, arXiv:2606.29444
Abstract:Dynamic epistemic logic represents belief change via model transformations induced by epistemic events. Its standard formulation (Baltag, Moss, Solecki, 1998) provides a natural account of belief expansion through the elimination of possibilities, but it cannot model belief contraction about factual propositions. A classic response enriches Kripke models with plausibility orderings, representing contraction as an update that promotes certain possibilities over others. We show that this approach has expressive limitations. In particular, the approach cannot model belief that violates positive introspection and contraction dynamics in response to a hedged public announcement that phi might be false. Motivated by these considerations, we introduce a mechanism for belief contraction defined directly on standard Kripke models, without any constraints on the doxastic accessibility relation. We show that it satisfies some of the standard properties of belief contraction but not others, study the conditions under which contraction may be unsuccessful, and provide a sound and complete axiomatization of the logic via reduction axioms. We also define a more general dynamic logic that is an extension of standard DEL and accommodates belief contractions due to events such as private or semi-private announcements, and provide a complete and sound axiomatization of the general logic. Comments: In Proceedings AiML 2026, arXiv:2606.29444 Subjects: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.31861 [cs.LO] (or arXiv:2606.31861v1 [cs.LO] for this version) https://doi.org/10.48550/arXiv.2606.31861 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: EPTCS 447, 2026, pp. 137-157 Related DOI: https://doi.org/10.4204/EPTCS.447.8 Focus to learn more DOI(s) linking to related resources
[AI-12] Z-1: Efficient Reinforcement Learning for Vision-Language-Action Models
链接: https://arxiv.org/abs/2606.31846
作者: Lang Cao,Renhong Chen,Luyi Li,Peng Wang,Mofan Peng,Yitong Li
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) models offer a promising framework for robotic manipulation by connecting language instructions, visual observations, and continuous control. However, most existing policies remain limited by behavior cloning or supervised fine-tuning (SFT) from fixed demonstrations, which provides limited opportunity to improve from the policy’s own failures. In this paper, we present Z-1, a reinforcement learning (RL) post-training framework for flow-based VLA models. Built on top of \pi_0.5 , Z-1 uses only publicly released RoboCasa demonstrations for SFT and then applies a task-wise Group Relative Policy Optimization (GRPO) strategy across 24 standard RoboCasa tasks. To improve the efficiency and stability of online optimization, Z-1 combines shared-prefix rollout construction, tree-structured trajectory branching, completion-aware reward calibration, and selective joint training of VLM and Action Expert. Across all 24 RoboCasa tasks, Z-1 achieves an average success rate of 80.6% , improving over its SFT initialization by 13.2% points and outperforms the published sota models. These results show that systematic GRPO post-training can substantially improve flow-based VLA policies without additional private demonstrations.
[AI-13] Bridging Local Observation and Global Simulation in Closed-Loop Traffic Modeling
链接: https://arxiv.org/abs/2606.31844
作者: Ziyan Wang,Tan Xiang,Peng Chen,Xintao Yan
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:A local-to-global context mismatch arises when autoregressive traffic simulators trained on ego-centric driving logs are deployed in globally observable closed-loop environments. In such logs, the ego vehicle has rich local observations, while surrounding agents are only partially observed due to perception limits and occlusions. As a result, simulators may learn incomplete context–action mappings that remain hidden in log-based training but emerge during closed-loop rollouts, leading to unrealistic behaviors such as abnormal stops, unsafe interactions, and rule violations. We propose CRAFT, a Contextual pReference Alignment Framework for Traffic Simulation, to mitigate this mismatch via self-supervised failure discovery and preference-guided test-time alignment. CRAFT treats the base simulator as a globally observable sandbox, generating diverse what-if rollouts from logged initial states to expose context-induced failures. These failures are grounded with human-aligned driving priors and converted into preference supervision for training a Contextual Preference Evaluator (CPE). At inference time, CPE acts as a plug-in alignment module that scores candidate actions under complete scene context and reweights autoregressive decoding toward globally coherent behaviors. CRAFT mitigates this local-to-global contextual bias, reducing collisions by 31.2% and traffic violations by 33.2% without retraining the base simulator.
[AI-14] An Agent ic AI Framework to Accelerate Scientific Discovery in Plant Phenotyping
链接: https://arxiv.org/abs/2606.31831
作者: Renan Souza,Daniel Rosendo,Kelsey Carter,John Lagergren,Frédéric Suter,Shelaine L. Curd,Gerald A. Tuskan,Rafael Ferreira da Silva,David Weston
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:High-throughput plant phenotyping now generates image derived datasets far faster than scientists can analyze them. At Oak Ridge National Laboratory’s Advanced Plant Phenotyping Laboratory (APPL), automated stations image hundreds of plants daily across multiple remote sensing modalities; yet, trait extraction and interpretation remain manual, expert-bound, and strictly post-hoc, making analysis, not acquisition, the binding constraint on discovery. We present an end-to-end agentic AI framework that turns the facility from a data factory into an interactive autonomous, discovery platform, where scientists partner with AI agents to accelerate time to insight. A conversational Co-Scientist Agent translates a scientist’s natural-language question into a structured analysis plan, and a headless Compute Agent dispatches Vision Transformer segmentation and trait extraction on the Frontier exascale supercomputer. The two agents run in separate security and resource domains and communicate over a secure, token-authenticated streaming channel, a design that accounts for the federation, data-movement, and provenance realities cloud-native agentic frameworks ignore, ensuring end-to-end provenance is captured for every interaction. The framework turns a days- to weeks-long analysis process into an interactive loop where agents reason over results, recommend next analyses, and respond to follow-up questions in seconds.
[AI-15] Adaptive Cluster-First Route-Second Decomposition for Industrial-Scale Vehicle Routing
链接: https://arxiv.org/abs/2606.31820
作者: Oguzhan Karaahmetoglu(1),Hyong Kim(1) ((1) Carnegie Mellon University)
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 6 figures, 5 tables
Abstract:Large-scale capacitated vehicle routing problems (CVRPs) are commonly addressed using cluster-first route-second (CFRS) approaches that split a routing instance into smaller, computationally tractable subproblems. Existing splitting methods typically rely on fixed partitioning rules, predefined optimization objectives, or learned policies, which may perform inconsistently across instances exhibiting different spatial, demand, and operational characteristics. In this work, we propose an adaptive CFRS system that formulates a decomposition procedure as an iterative decision-making process. Motivated by the recent success of large language models (LLMs) in reasoning and tool selection, the system employs an LLM as a high-level decision maker that analyzes the evolving decomposition state and selectively applies further clustering, balancing, and refinement operators. The proposed algorithm jointly partitions customers and vehicles, enabling capacity-aware clustering while adapting partitioning decisions to the characteristics of each problem. We evaluate the approach on synthetic and benchmark-derived CVRP instances containing up to 500,000 customers. Experimental results demonstrate competitive performance on benchmark-scale instances while exhibiting improved scalability and robust routing quality on substantially larger problems. These results highlight the potential of adaptive, LLM-guided decision support as a practical approach for industrial-scale vehicle routing and large-scale logistics planning.
[AI-16] Creating Intelligence: A Computational Foundation for AGI
链接: https://arxiv.org/abs/2606.31819
作者: Peter Overmann
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work introduces a new computational theory of mind grounded in set theory and hyperdimensional computing. Whereas traditional neural networks rely on continuous weights and matrix multiplication, this framework works with sparse binary data. It represents information as discrete sets, directly modeling biological neural population codes. I demonstrate that associative memory emerges naturally from network topologies featuring a combinatorially expanded hidden layer. Learning is driven by topological plasticity rather than scalar weight adjustments. This architecture unifies auto-associative and hetero-associative learning under a single core algorithm: information retrieval via subset pattern matching and exact nearest-neighbor search. Operating with constant-time complexity, these mechanisms bridge perceptual data (sparse distributed representations) and symbols (sparse holographic representations) without continuous bottlenecks. Mapping this framework to neuroanatomy, I propose that both the cerebellum and the neocortex implement variants of this algorithm, making subset pattern matching the fundamental engine of cognition. Because it relies on discrete logic rather than matrix arithmetic, this algorithm translates directly into in-memory hardware. This opens a new route toward synthetic intelligence with human-level energy efficiency.
[AI-17] Geometry-Preserving Orthonormal Initialization for Low-Rank Adaptation in RLVR ICML2026
链接: https://arxiv.org/abs/2606.31813
作者: Ruijia Zhang,Jiacheng Zhu,Hanqing Zhu,Laixi Shi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 30 pages, accepted to ICML 2026
Abstract:Low-rank adaptation (LoRA) and its variants enable parameter-efficient fine-tuning of large language models under the supervised fine-tuning (SFT) paradigm. However, their efficacy and behavior under Reinforcement learning with verifiable rewards (RLVR) are less well understood. In particular, two structurally initialized LoRA variants, PiSSA and MiLoRA, which outperform standard LoRA under SFT, can underperform standard LoRA under RLVR and may even exhibit training instability. These observations suggest that how to initialize the low-rank matrices in RLVR remains unclear. In this work, we develop a theoretical analysis of LoRA in RLVR, showing that orthonormal initialization achieves the minimal gap between LoRA outcome and that of full fine-tuning. Guided by this insight, we propose geometry-preserving orthonormal initialization for low-rank adaptation in RLVR, leading to two new variants, RLPO and RLMO. Experiments on mathematical reasoning benchmarks show that the proposed orthonormal initialization stabilizes RLVR training and outperforms standard LoRA, contrasting with PiSSA and MiLoRA. Finally, our unified analysis for LoRA initialization also explains why PiSSA and MiLoRA can underperform in RLVR, which may be of independent interest. Code and checkpoints are publicly available at this https URL.
[AI-18] Large Databases Need Small Open-Weight Language Models
链接: https://arxiv.org/abs/2606.31808
作者: Parker Glenn,Alfy Samuel
类目: Artificial Intelligence (cs.AI); Databases (cs.DB)
备注:
Abstract:Language model systems built around proprietary APIs often operate on a token-based cost model. This becomes prohibitively expensive in the context of large databases, where LM-enhanced relational operators can incur costs exceeding 10,000 for a single set of experiments, hindering thorough research and practical deployment. In this paper, we demonstrate that quantized, open-weight models running locally on just 16GB of VRAM can match or exceed the accuracy of closed-source counterparts at lower latency and a fraction of the price, challenging the prevailing assumption that closed-source LM APIs are necessary for effective LM-database integration. We present and analyze the key system optimizations required to efficiently deploy these open-weight models within an LM-DB system. By integrating these local models into the BlendSQL v0.1.0 framework, we demonstrate a 390x reduction in overall costs and 3.8x reduction in latency compared to a proprietary LM API. We make our code available at this https URL.
[AI-19] RAISE: LLM -based Automated Heuristic Design with Robust Adversary Instance Search
链接: https://arxiv.org/abs/2606.31801
作者: Fei Liu,Alessio Figalli,Patrick Owen,Nicola Serra
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Automated Heuristic Design (AHD) with Large Language Models (LLMs) has shown remarkable progress in discovering high-quality heuristics. However, existing LLM-based AHD methods optimize heuristics for a fixed training instance set and may fail catastrophically when deployed under real-world distributional shifts. We propose Robust Adversary Instance Search (RAISE), a framework that integrates constrained worst-case instance search within a principled neighborhood of the training distribution into the LLM-based evolutionary search loop. RAISE treats robust AHD as a constrained adversarial instance search problem: the outer loop evolves heuristics via LLM operators, while an LLM-free inner loop efficiently identifies hard instances within an epsilon-ball around the training instance set using a basis distribution parameterization with boundary projection. Comprehensive experiments on Online Bin Packing (OBP), Online Job Shop Scheduling (OJSP), and Online Vehicle Routing (OVRP) across five distribution families demonstrate that existing LLM-based AHD methods degrade by up to 19 times under distribution shift, while RAISE consistently maintains strong performance across all tested distributions and problem scales
[AI-20] Evo-PI: Aligning Medical Reasoning via Evolving Principle-Guided Supervision
链接: https://arxiv.org/abs/2606.31800
作者: Xianda Zheng,Huan Gao,Meng-Fen Chiang,Michael Witbrock,Kaiqi Zhao,Shangyang Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite recent progress, the reasoning capabilities of large multimodal language models (MLLMs) remain fundamentally constrained by static supervision, where fixed prompts, rules, or reward models provide non-adaptive guidance throughout training. Such static signals are often sufficient to enforce output formats, but fail to shape the underlying reasoning process, leading to brittle generalization and performance saturation in complex decision-making tasks. We propose Evo-PI, a principle-centric learning framework that treats reasoning principles as explicit, language-based supervision signals that can be generated, evaluated, and iteratively evolved. Instead of relying on fixed rewards, Evo-PI enables a co-evolutionary loop in which principles guide model reasoning, while model behaviors in turn refine the principles that supervise them. This dynamic alignment mechanism allows supervision to progressively adapt to the model’s reasoning deficiencies. We instantiate Evo-PI in medical visual question answering as a high-stakes testbed requiring structured visual-textual reasoning. Across eight benchmarks and multiple model backbones, Evo-PI consistently improves reasoning accuracy, achieving gains of up to 24.6%. Our results suggest that evolving principle-guided supervision offers a scalable and general paradigm for training expert-aligned reasoning in MLLMs. Code is available at this https URL.
[AI-21] A Self-Evolving Agent ic System for Automated Generation and Execution of Biological Protocols
链接: https://arxiv.org/abs/2606.31763
作者: Yankai Jiang,Weiting Tang,Haoran Sun,Zhenyu Tang,Yuejie Hou,Yingnan Han,Rubo Wang,Yueyuxiao Yang,Cheng Liang,Lilong Wang,Wenjie Lou,Xiaosong Wang,Lei Bai,Meng Yang
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous wet-lab experimentation requires more than plausible protocol text: biological intent, quantitative procedures, device constraints and experimental feedback must remain aligned from protocol and SOP design to code and physical execution. We developed ProtoPilot, a self-evolving multi-agent system, together with an expert-grounded benchmark and evaluation framework for testing this conversion as an experimental automation problem. The framework spans 294 synthetic-biology and molecular-biology tasks derived from 98 gold-standard protocols, wet-lab expert rubrics, device-level validity gates and real experimental tests. ProtoPilot incorporates layer-wise verifiability, multi-agent orchestration and a runtime-updated skill library to generate protocols, expand SOPs, synthesize SDK-compliant code and revise workflows from wet-lab feedback. It achieved a Top@3 expert-preference rate of 90.2%, an overall protocol-to-code gate pass rate of 89.5% and an Opentrons pass rate of 88.24%, compared with 32.35% for OpenTrons-AI. Wet-lab validation produced interpretable readouts, Sanger-confirmed products and feedback-corrected PCA-assembled DNA targets, establishing a verifiable route to autonomous experimentation. Together, these results show that the evaluation framework captures execution-relevant requirements for autonomous wet-lab automation, and that ProtoPilot can meet them by converting protocol and code generation into validated execution and feedback-guided revision.
[AI-22] A Technical Typology of AI Systems in Public Administration
链接: https://arxiv.org/abs/2606.31755
作者: Jonathan Rystrøm,Chris Schmitz,Nathan Davies,Gerhard Hammerschmid,Albert Meijer,Chris Russell
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Research on artificial intelligence (AI) in the public sector often treats “AI” as a single category, neglecting technical distinctions between different AI systems. But these distinctions affect how different systems impact core public values like accountability, procedural justice, and non-discrimination. This paper argues that public administration research would benefit from more technical precision on “AI” and makes three contributions to this end. First, we introduce a typology of five categories of AI systems: hand-coded, glass-box, black-box, general-purpose, and agentic systems. We calibrate the typology to public administration by grouping system types by their distinct implications for public values. Second, we evaluate technical precision in recent public administration research about AI by coding 91 highly-cited papers (2019-2025) using our typology. We find widespread imprecision: most papers (55%) leave the studied system underspecified, 31% motivate their work with a different system than they study, and 41% make more general conclusions than the studied system supports. Finally, we give practical recommendations for future research. We highlight common pitfalls to avoid, and suggest that researchers should, at a minimum, provide enough technical detail to locate the studied system in our typology. To this end, we provide a practical guide – a short set of diagnostic questions answerable from public information and without specialist technical knowledge.
[AI-23] FedXDS: Leverag ing Model Attribution Methods to counteract Data Heterogeneity in Federated Learning
链接: https://arxiv.org/abs/2606.31742
作者: Maximilian Andreas Hoefler,Karsten Mueller,Wojciech Samek
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Explainable AI (XAI) methods have demonstrated significant success in recent years at identifying relevant features in input data that drive deep learning model decisions, enhancing interpretability for users. However, the potential of XAI beyond providing model transparency has remained largely unexplored in adjacent machine learning domains. In this paper, we show for the first time how XAI can be utilized in the context of federated learning. Specifically, while federated learning enables collaborative model training without raw data sharing, it suffers from performance degradation when client data distributions exhibit statistical heterogeneity. We introduce FedXDS (Federated Learning via XAI-guided Data Sharing), the first approach to utilize feature attribution techniques to identify precisely which data elements should be selectively shared between clients to mitigate heterogeneity. By employing propagation-based attribution, our method identifies task-relevant features through a single backward pass, enabling selective data sharing that aligns client contributions. To protect sensitive information, we incorporate metric privacy techniques that provide formal privacy guarantees while preserving utility. Experimental results demonstrate that our approach consistently achieves higher accuracy and faster convergence compared to existing methods across varying client numbers and heterogeneity settings. We provide theoretical privacy guarantees and empirically demonstrate robustness against both membership inference and feature inversion attacks. Code is available at this https URL.
[AI-24] Arena-T2I Hard: Benchmarking and Improving Faithfulness with Dependency-Aware Checklist
链接: https://arxiv.org/abs/2606.31711
作者: Yuanhao Ban,Tong Xie,Sohyun An,Yunqi Hong,Evan Frick,I-Hung Hsu,Wei-Lin Chiang,Ion Stoica,Cho-Jui Hsieh
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Faithfulness – how precisely a generated image aligns with its prompt – is increasingly central to the real-world utility of text-to-image (T2I) models. Existing faithfulness benchmarks, however, rely on simple atomic instructions, on which top-tier systems already achieve near-perfect scores. As T2I models enter creative workflows, users issue multi-faceted requests combining intricate spatial relationships, stylistic constraints, and complex text rendering. In this setting, a single binary VLM-judge score no longer captures which specific constraints the model fails to satisfy. We introduce Arena-T2I Hard, a 310-prompt stress benchmark drawn from real arena T2I logs, with approximately 30 decomposed yes/no constraints per prompt spanning six categories, including text rendering. The strongest closed-source system we evaluate reaches 0.855 with a 33~pp performance gap across 11 systems, demonstrating substantial discriminative power. Moreover, high public-arena rankings fail to predict faithfulness, confirming that holistic Bradley-Terry (BT) preference scores prioritize aesthetics over fine-grained prompt adherence. We propose a dependency-aware checklist reward that decomposes each prompt into a DAG of yes/no questions and zeroes descendants of failed parents, turning faithfulness into a per-constraint training signal. Combined with a BT aesthetic reward via group-decoupled normalization (GDPO), which standardizes each reward within its rollout group so neither collapses, the recipe attains a strictly better faithfulness-aesthetics trade-off on SD3.5-Medium and FLUX.1-dev under MMRB2 pairwise comparisons than every single-reward, naive weighted-sum, or 4-reward BT-ensemble baseline.
[AI-25] When to Truncate a Feature Ranking: A Residual-Overlap Stopping Rule for Subset Selection
链接: https://arxiv.org/abs/2606.31686
作者: Jesus S. Aguilar-Ruiz
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Feature rankings are widely used in supervised feature selection because they are simple, scalable and easy to interpret. Variables are first ranked by a relevance score, and a subset is then obtained by retaining the top-ranked variables. Although the first stage has been extensively studied, the second is often governed by an arbitrary cardinality, an empirical threshold or cross-validation, without a direct interpretation. This raises a basic question: given a feature ranking, when is there enough accumulated class-separation evidence to stop selecting features? This paper develops a distributional framework for transforming supervised feature rankings into class-independent subsets through an explicit risk-calibrated stopping rule. For each variable and each pair of classes, marginal separation is measured by the Bhattacharyya coefficient between the corresponding class-conditional distributions. The proposed method selects a single global subset shared by all classes by retaining the shortest prefix of a ranking whose residual product overlap falls below a prescribed threshold for every relevant class contrast. We derive binary and multiclass Bayes-risk bounds for the labelled product marginal problem, and obtain prior-dependent and prior-free calibrations of the residual-overlap threshold from a target all-pairs risk level. An empirical comparison on high-dimensional genomic datasets illustrates that the rule can reduce tens of thousands of variables to a few dozen while maintaining predictive performance statistically comparable to the all-features baseline. As the stopping rule only requires one-dimensional marginal overlap estimates and scans a precomputed ranking, it is well suited to very high-dimensional settings where exhaustive subset search is infeasible and interpretable truncation of feature rankings is essential. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.31686 [cs.LG] (or arXiv:2606.31686v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.31686 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-26] Improving Certified Robustness via Adversarial Distillation
链接: https://arxiv.org/abs/2606.31653
作者: Matteo Melis,Jesus Martinez Del Rincon,Vishal Sharma
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Certified training aims to produce models whose predictions can be formally verified against adversarial perturbations, typically by optimising upper bounds on the worst-case loss over an allowed perturbation set. For neural networks, certified training methods based purely on tight relaxation bounds produce networks that are amenable to certification, but sacrifice standard accuracy. Conversely, adversarial training often yields stronger empirical robustness and standard accuracy, but the resulting models are generally difficult to certify with neural network verifiers. Recently, the literature has shown that better standard-certified accuracy trade-offs can be achieved by combining adversarial training objectives with loose over-approximations based on Interval Bound Propagation (IBP), effectively interpolating between lower and upper bounds of the worst-case loss. Building on this, we introduce AD-CERT, a certified training objective that combines adversarial distillation with an IBP upper bound. We show that distilling adversarial information over the logit space from an empirically robust teacher provides an effective lower bound surrogate for certified training, with AD-CERT achieving state-of-the-art certified performance on several robustness benchmarks. Furthermore, in a unified setup, distilling adversarial information at the logit-level is shown to improve certified accuracy over a robust feature-space distillation objective by up to 5.40 percentage points.
[AI-27] FARS: A Fully Automated Research System Deployed at Scale
链接: https://arxiv.org/abs/2606.31651
作者: Qiong Tang,Xiangkun Hu,Xiangyang Liu,Yiran Chen,Yunfan Shao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent automated research systems show that language-model agents can generate hypotheses, run experiments, and write complete manuscripts, but most evidence still comes from selected examples, human-framed topics, or a few pre-defined research tasks. We present FARS (Fully Automated Research System), a fully automated AI-for-AI research system designed to operate across research topics at scale. FARS autonomously generates and advances projects through ideation, planning, experimentation, and writing, using stage-specific agents coordinated through a shared workspace that records proposals, code, logs, results, and manuscripts. In its first public deployment, FARS produced 166 complete research papers spanning 67 fine-grained AI/ML topics while preserving intermediate artifacts as an auditable corpus rather than a curated set of successes. We evaluate this corpus with 282 structured reviews from volunteer reviewers covering 140 papers, including overall ratings, sub-scores, integrity checks, and LLM-use disclosure. The reviews indicate that FARS can produce review-worthy and occasionally strong AI/ML research artifacts in a large-scale public deployment, while also exposing recurring failure modes in narrow experimental scope, methodological limitations, and integrity issues.
[AI-28] ECHO: Prune to act trace to learn with selective turn memory in agent ic RL
链接: https://arxiv.org/abs/2606.31650
作者: Zijun Xie,Binbin Zheng,Enlei Gong,Jihua Liu,Yuyang You,Lingfeng Liu,Jiayao Tang,Guanqun Zhao,Aoqi Hu,Zeyu Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Long-horizon language agents must repeatedly interact with tools, accumulate evidence, and make decisions under bounded context windows. Existing context-management methods make such rollouts feasible by truncating distant history, folding past turns into summaries, or selecting compact memory states. However, these breakthroughs introduce two coupled limitations. First, as the number of turns grows, historical observations are progressively removed or collapsed into compressed states, making it harder for the policy to reuse fine-grained evidence. Second, once the original turns are no longer source-addressable, outcome-based RL loses an explicit path for aligning policy updates with the evidence that supported a successful final answer. To this end, we propose ECHO, a selective turn-memory framework that jointly addresses history collapse and traceable learning through source-indexed reconstruction. Specifically, ECHO compresses each completed environment turn into a compact memory record, reconstructs bounded policy contexts by selecting from these records, and reuses the selected source indices to route positive outcome credit to the evidence and selection actions that support successful answers. On BrowseComp-Plus, ECHO reaches 43.4% held-out accuracy, outperforming GRPO (28.9%) and the rolling-summary baseline SUPO (36.1%), while using fewer turns and lower trajectory volume than SUPO (Figure 1). Additionally, the trained policy improves zero-shot generalization across multi-objective QA, code generation, and deep information-seeking benchmarks on both dense and MoE backbones.
[AI-29] hink in English Answer in Korean: Efficient Adaptation of Multilingual Tool-Using Agents
链接: https://arxiv.org/abs/2606.31648
作者: Utsav Garg,Sungjin Hong,Jason Jung,Justin Lee,Shaan Desai,Joon Hee Kim,Anirudh Shrinivason,Edmond Wen,Susie Park
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We present LuckyStar 111B, a 111B-parameter hybrid reasoning model developed through a collaboration between Cohere and LG CNS for Korean-English enterprise agents under practical memory and serving constraints. The model trains from Cohere’s fully post-trained Command A model rather than a new pretraining run, and uses preamble conditioning to switch between concise non-reasoning behavior and longer tool-oriented reasoning. We study four choices for scaling tool-using agents efficiently: multilingual supervised fine-tuning, reinforcement learning with verifiable rewards for multi-step tool-use tasks, language-consistency rewards for Korean user-facing responses, and 4-bit quantization for single-GPU serving. The adapted model improves mathematical reasoning, function calling, and agentic natural-language-to-SQL (NL2SQL) performance while preserving general Korean and English instruction-following quality. These results provide a practical recipe and failure-mode analysis for adapting post-trained multilingual models to verifiable agentic workflows under memory-constrained deployment.
[AI-30] A Lifecycle and Application-Stack Survey of Large Language Model Vulnerabilities: Attacks Risks Defenses and Open Problems
链接: https://arxiv.org/abs/2606.31639
作者: Seyed Bagher Hashemi Natanzi,Bo Tang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Logic in Computer Science (cs.LO)
备注:
Abstract:Large language models are no longer only text generators. They are increasingly embedded in retrieval pipelines, enterprise assistants, coding environments, robotic systems, security-operation workflows, and autonomous agents that can read private data, call tools, write files, execute code, and act across organizational boundaries. This shift changes the security problem: risks do not arise from the model weights alone, but from the full lifecycle and application stack through which data, prompts, model outputs, tools, memories, and user authority interact. This paper systematizes the literature on vulnerabilities in large language model systems through a lifecycle and application-stack lens. We organize attacks across eight stages: data collection, pretraining, post-training alignment, model packaging and supply chain, retrieval and memory, prompting and inference, tool/agent execution, and deployment/maintenance. For each stage, we analyze attacker capabilities, affected security objectives, representative attacks, practical risks, evaluation practices, and defenses. We further map LLM-specific vulnerabilities to confidentiality, integrity, availability, safety, privacy, fairness, accountability, and agency-control objectives. Unlike taxonomies that list isolated attack names, the proposed systematization emphasizes where trust boundaries fail, how untrusted data becomes executable instruction, how delegated authority amplifies model errors, and why point defenses rarely compose. We close with a research agenda for secure LLM systems, including compositional security, provenance-aware retrieval, tool-call containment, long-horizon agent evaluation, privacy-preserving adaptation, realistic red teaming, and deployment-grade incident response.
[AI-31] Intrinsic decomposition and editing of 3D Gaussian splats
链接: https://arxiv.org/abs/2606.31637
作者: Alexandre Lanvin,Jeffrey Hu,Simon Lucas,Adrien Bousseau,George Drettakis
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注: 18 pages
Abstract:Intrinsic decomposition which expresses image colors as the product of diffuse albedo and shading, possibly augmented with view-dependent residuals has a long history in image editing as it enables the modification of object colors and textures without altering lighting. We extend intrinsic decomposition to radiance fields represented with Gaussian splatting by proposing solutions to three key aspects of such decomposition. First, we describe how to model the intrinsic decomposition as independent sets of Gaussian primitives, which allows each set to adapt to the characteristics of the layer it represents. Second, we present an optimization procedure guided by data-driven predictions to disentangle multi-view photographs of a scene into the aforementioned intrinsic sets. Finally, we provide an editing workflow where users modify the texture of planar surfaces simply by modifying the albedo of that surface in one image. Capturing this edit within the intrinsic radiance field allows re-rendering of the edited scene with plausible lighting under arbitrary viewpoints.
[AI-32] Scientific Explanations in Health Sciences: Causality Trust and Epistemic Adequacy
链接: https://arxiv.org/abs/2606.31616
作者: Martina Mattioli,Marcello Pelillo
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Medical Artificial Intelligence (AI) is widely expected to transform clinical practice, yet the decision-making processes of many Machine Learning (ML) models remain opaque. Explainability has been advanced as a partial remedy to clarify why AI generates predictions, particularly in high-stakes contexts. Despite ongoing efforts, debates on what constitutes an adequate medical explanation remain unsettled. Yet, explanation has long been a central topic of inquiry in the philosophy of science and medicine. The insights developed in these fields, however, have been largely overlooked in contemporary explainable AI (XAI) research, leaving its foundational assumptions insufficiently examined. To address this gap, this paper develops a critical review at the intersection of philosophy of science and XAI. It examines prevailing accounts of what counts as an explanation in the health sciences and assesses their adequacy for informing XAI in medicine, arguing that they provide necessary conditions for a philosophically grounded approach to explainability in this domain. Building on this foundational philosophical literature, the discussion identifies three central axes of analysis: the role of causality in medical reasoning, the epistemic and relational dimensions of medical trust, and the criteria of explanatory adequacy as shaped by the pragmatic needs of diverse stakeholders. By integrating philosophical analysis with current developments in medical AI, the paper outlines principles for designing XAI systems that offer explanations that are not only epistemically robust but also aligned with the epistemic and practical requirements of clinical decision-making, shaping ongoing debates in medical XAI toward underexplored conceptual foundations.
[AI-33] Automating Cause-Effect Specification with Knowledge Graphs and Large Language Models
链接: https://arxiv.org/abs/2606.31614
作者: Javal Vyas,Milapji Singh Gill,Mehmet Mercangöz
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI)
备注:
Abstract:Engineering specifications such as interlocks, alarm rationalization tables, and cause-and-effect (CE) matrices remain central to process control and safety, yet their creation is still predominantly manual, document-driven, and prone to inconsistency. This paper presents a semantic-AI framework that automates the generation of CE logic by combining a knowledge graph (KG) with a constrained large language model (LLM) layer. The KG builds on an established modular alignment ontology to represent process structure, operating modes, faults, symptoms, causes, and mitigation actions in a machine-interpretable form. The LLM then transforms this information into operator-ready safety narratives and Semantic Web Rule Language (SWRL) rules under strict ontology and vocabulary constraints, grounding the generated artifacts in the underlying semantic model. The workflow is demonstrated on a modular process plant, showing how engineering semantics, diagnostic relations, and machine-verifiable specifications can be generated from a unified knowledge representation with reduced manual effort.
[AI-34] Comparative Analysis of Machine Learning based Intrusion Detection in Realistic IoT Networks
链接: https://arxiv.org/abs/2606.31594
作者: Rana Alharbi,Chuadhry Mujeeb Ahmed
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The Internet of Things (IoT) is rapidly growing and expanding into various sectors, such as healthcare, transportation, smart homes, and more. Despite the benefits of using IoT devices, they present several challenges. Given the significant role these devices play in our lives, it is crucial to address issues related to their security and privacy. These devices are limited in resources, which complicates their security and the protection of the data that they manage. The paper aims to examine intrusion detection systems using the Gotham2025 dataset, generated through the Gotham testbed, which consists of 78 emulated IoT devices utilising various protocols, including MQTT, CoAP, and RTSP, to assist in safeguarding IoT networks from attacks. We conduct a comparative analysis between five machine learning algorithms, including Random Forest, XGBoost, Logistic Regression, Naive Bayes, and Deep Neural Network. We demonstrate that the Random Forest Classifier was the top-performing model, achieving an F1-score of 0.99 in classifying attacks.
[AI-35] Evil Spectra: How Optimisers can Amplify or Suppress Emergent Misalignment
链接: https://arxiv.org/abs/2606.31591
作者: Jason R. Brown,Patrick Leask,Lev McKinney
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Emergent misalignment (EM) is a recently discovered phenomenon in LLMs where fine-tuning on a narrow misaligned task, such as writing insecure code, leads to broadly misaligned behaviour on unrelated prompts. Previous work has noted that the severity of EM is highly sensitive to training choices; however, we still lack a systematic characterisation of this sensitivity. We perform a sweep over several Qwen3 models, optimisers, datasets, and batch sizes, and find that the choice of optimiser has the largest effect, producing a 7x spread in misalignment rate. Surprisingly, model size has a negligible effect within the Qwen3 family. An additional sweep over 12 models from three families using Adam confirms that model scale (1B-235B) and family have negligible effects for that optimiser. Analysing the loss-alignment relationship on Qwen3-8B, we find that final log training loss is a strong predictor of alignment, and that stratifying by optimiser captures nearly all the residual variance. Training dynamics reveal that each optimiser follows a different trajectory through loss-alignment space, and that after significant training, the optimiser becomes more important than training loss as a predictor of alignment. Muon, the adaptive optimiser that preserves alignment the best, implicitly regularises for a more uniform distribution of singular values of the LoRA adapter. We evaluate this insight by training with an additional loss term that incentivises a flatter singular value spectrum, and find that this substantially recovers alignment for the more EM-prone adaptive optimisers (Adam and Lion), with negligible cost to training loss. These results identify optimiser choice as a key factor in EM severity, but show that spectral regularisation can substantially mitigate the effects of EM-prone optimisers.
[AI-36] ZEBRA: Zero-Shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization in Audio-Language Models INTERSPEECH2026
链接: https://arxiv.org/abs/2606.31587
作者: Asif Hanif,Mohammad Yaqub
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted in InterSpeech 2026
Abstract:Audio-Language Models (ALMs) achieve strong zero-shot performance by aligning audio with textual class descriptions. Although prompt learning improves accuracy on base classes through few-shot supervised adaptation, we observe a critical trade-off: it often degrades performance on novel classes, sometimes falling below zero-shot accuracy. This exposes a base-to-novel generalization gap in prompt learning for ALMs. To address this issue, we propose \textbfZEBRA (Zero-shot Entropy-Regularized Prompt Learning for Base-to-Novel Generalization), a plug-and-play framework that fuses zero-shot logits with prompt-learning logits, and employs self-entropy regularization to reduce overfitting to base classes. Experiments across multiple audio classification datasets show that ZEBRA consistently improves novel-class performance while maintaining strong base accuracy, significantly reducing the base-to-novel gap compared to standard prompt learning. The code is available at: this https URL.
[AI-37] Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index
链接: https://arxiv.org/abs/2606.31575
作者: Outongyi Lv,Yanzhao Zheng,Yuanwei Zhang,Zhenghao Huang,Xingjun Wang,Baohua Dong,Hangcheng Zhu,Yingda Chen
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 4 figures
Abstract:Reinforcement learning (RL) has become a powerful tool for propelling Large Language Models (LLMs) beyond imitation-based training towards more robust reasoning capabilities. Among existing approaches, RL with Verifiable Rewards (RLVR) has emerged as a pivotal paradigm for advancing LLM reasoning. Despite its empirical success, recent studies have offered different insights. One line of inquiry advocates prioritizing high-entropy token positions during training, while another perspective cautions against allowing low-probability tokens to dominate gradient updates. Notably, although high-entropy tokens are usually correlated with low probability, both paradigms empirically yield substantial performance gains. In this work, we argue that evaluating sampled-token probability or entropy in isolation is insufficient to capture the policy optimization dynamics. To resolve this tension, we introduce the Relative Surprisal Index (RSI), a principled, information-theoretic metric that naturally couples the token’s entropy with the probability of the selected token. We show that, under mild conditions, RSI is related to the local ratio between the first-order variations of the logit-gradient norm and predictive entropy under a selected-logit perturbation. Building on RSI, we propose RSI Selection (RSI-S), an entropy-adaptive token filtering method that retains tokens within a stable RSI interval. RSI-S successfully reconciles previous contradictory paradigms and filters out both redundant low-surprisal tokens and unstable high-surprisal tail tokens. Empirical evaluations show that RSI-S achieves higher avg@32 accuracy across different model scales (Qwen2.5-1.5B, 3B, and 7B) on AIME and AMC benchmarks: RSI-S improves avg@32 accuracy by 2–3 percentage points over GRPO. Overall, RSI offers a promising perspective for RLVR improvement.
[AI-38] FLARE-AI: Flaw Reporting for AI ICML2026
链接: https://arxiv.org/abs/2606.31567
作者: Shayne Longpre,Elaine Zhu,Carson Ezell,Avijit Ghosh,Sean McGregor,Kevin Paeth,Kevin Klyman,Sayash Kapoor,Rishi Bommasani,Ruth Appel,Gregory Strom,Lauren McIlvenny,Mark M. Jaycox,Peter Slattery,Nathan Butters,Arvind Narayanan,Percy Liang,Alex Pentland
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted to ICML 2026
Abstract:Flaw reporting for deployed AI systems is fundamental to identifying system failures and improving AI safety. Yet the AI reporting ecosystem is fragmented: researchers who identify flaws often do not know what or where to report, and groups who receive reports rarely share them with other relevant stakeholders. As a result, good-faith reporters duplicate effort by submitting many different forms, and recipients lack standardized, triage-ready information. We audit 12 reporting systems published by AI developers, cybersecurity groups, and AI flaw aggregators, identifying five recurring design challenges spanning discoverability, scope, information collection, coordination, and guidance for strict-liability cases. Building on this analysis and feedback from 49 experts across 32 organizations representing developers, security researchers, and ecosystem coordinators, we introduce FLARE-AI, an open-source AI flaw reporting system designed for interoperability with existing systems. FLARE-AI streamlines flaw report creation by collecting triage-relevant information through conditional logic and early classification, then enables optional dissemination of standardized, machine-readable reports to multiple developers, coordinators, and incident registries from a single submission. By lowering barriers to reporting AI flaws and improving interoperability across stakeholders, FLARE-AI helps break down silos and accelerate remediation across the AI ecosystem.
[AI-39] ACE: Pluggable Adaptive Context Elasticizer across Agents
链接: https://arxiv.org/abs/2606.31564
作者: Ning Liao,Zihao Long,Xiaoxing Wang,Xue Yang,Yaoming Wang,Ziyuan Zhuang,Xunliang Cai,Rongxiang Weng,Junchi Yan
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing complexity of agentic tasks has led to rapidly growing trajectory lengths, which poses significant challenges for large language model (LLM) based agents with fixed context windows. Existing context management techniques, such as truncation and summarization, suffer from inherent inflexibility and irreversibility: once information is discarded or compressed, it cannot be recovered even when it becomes critically relevant in later decision steps. To address these limitations, we propose the Adaptive Context Elasticizer (ACE), a plug-and-play module that elastically orchestrates historical step information into the agent’s context at each decision step. ACE maintains a lossless message maintenance layer that stores both raw messages and compressed abstractions for each historical step, while a context orchestration layer adaptively assigns each step an elastic type as raw, abstract, or drop, at every decision step based on the current task state. This reversible design ensures that the main LLM always receives a compact yet information-rich context. We adapt ACE to four diverse agent frameworks, including ReAct, DeepAgent, WebThinker, and MiroFlow, without training or architectural modifications. Experiments show that ACE consistently outperforms truncation and summarization baselines, and brings consistent performance gains across all four agent frameworks.
[AI-40] CVE-TTP KG: Knowledge Graph Linking Software Vulnerabilities to Attack Behaviors
链接: https://arxiv.org/abs/2606.31557
作者: Basant Agarwal,Dincy R. Arikkat,Swati Yadav,Serena Nicolazzo,Antonino Nocera,Vinod P
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:In the evolving threat landscape, adversaries exploit software vulnerabilities to launch sophisticated attacks, challenging traditional defenses. Although databases like CVE and NVD provide detailed technical information, they often lack links to attacker behaviors such as tactics and techniques, limiting effective threat interpretation and response. This work bridges this gap by connecting vulnerabilities with behavioral patterns from the MITRE ATTCK framework. We construct a CVE-TTP Knowledge Graph that links CVEs to tactics and techniques using classification and relation extraction. Transformer-based models are developed for behavior identification, with CySecBERT achieving macro F1-scores of 87.71% (techniques) and 96.16% (tactics). Also, we created an annotated dataset with 24,820 entities and 43,608 relations for entity and relation extraction. The pipeline-based approach achieves macro F1-scores of 0.86 (entity extraction) and 0.99 (relation extraction), while a span-based joint model achieves 0.78. These outputs are integrated into a Neo4j-based Cyber Threat Knowledge Graph, enabling structured visualization of vulnerabilities.
[AI-41] A time-series classification framework for individual-level absenteeism prediction under severe class imbalance
链接: https://arxiv.org/abs/2606.31532
作者: Kwong Ho Li,Matthew Roughan,Wathsala Karunarathne
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Staff absenteeism imposes substantial operational costs in high-demand work environments such as healthcare, emergency services, meat processing, construction, and courier and delivery services, where proactive workforce planning depends on reliable individual-level absence prediction. Existing regression and classification approaches share a structural limitation; they map features observed at time t to labels at the same time t, reproducing already-realised outcomes rather than predicting future events, and discard the sequential behavioural structure inherent in individual attendance histories. We propose a Time Series Classification (TSC) framework that separates historical attendance sequences from future absence labels, enabling genuinely proactive prediction. Due to the lack of public longitudinal attendance data, we construct a reproducible simulated dataset calibrated to the UCI dataset. We analyse Binary Focal Loss (BFL) and Geometric Mean (G-Mean) loss under severe class imbalance using only the imbalance ratio \rho . For BFL, the initial gradient ratio is \rho\alpha/(1-\alpha) , implying the balanced weight \alpha = 1/(1+\rho) \approx 0.023 . Experiments show that performance is governed mainly by \alpha , with BFL achieving specificity 0.813 and balanced accuracy 0.888, comparable to G-Mean. Unlike BFL, G-Mean adapts automatically without parameter calibration. Among three deep learning architectures evaluated, Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and the hybrid LSTM-Fully Convolutional Network (LSTM-FCN), the LSTM-FCN delivers strong precision and specificity. Stable performance is obtained with batch sizes = 64 and window sizes between 40-80 days, yielding balanced accuracy of approximately 80% on held-out test data.
[AI-42] On the Convergence of Self-Improving Online LLM Alignment UAI2026
链接: https://arxiv.org/abs/2606.31524
作者: Xudong Wu,Pangpang Liu,Vaneet Aggarwal,Jiayu Chen
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at UAI 2026
Abstract:The Self-Improving Alignment (SAIL) algorithm addresses distribution shift by reducing a bilevel formulation of the problem to an efficient, single-level method. Empirically, SAIL has demonstrated strong performance on this task. However, a formal analysis of its convergence properties has been lacking. We identify a key theoretical challenge: the standard SAIL objective function is not guaranteed to be strongly concave due to unfavorable properties of its Hessian. To address this limitation, we propose a regularized objective, SAIL-RevKL, which incorporates a reverse Kullback-Leibler (KL) divergence penalty to improve the optimization landscape. Our central theoretical contribution is to prove that this regularized objective satisfies the Polyak-Lojasiewicz (PL) condition within a bounded parameter space. We establish global convergence guarantees, achieving a near-linear sample complexity. We further validate the effectiveness and stability of SAIL-RevKL through empirical evaluations, demonstrating that it outperforms the vanilla SAIL on both MuJoCo benchmarks and LLM alignment tasks.
[AI-43] Design and Implementation of Agent ic Orchestrations and Orchestration of Agents
链接: https://arxiv.org/abs/2606.31518
作者: Stefanie Rinderle-Ma,Juergen Mangler,Johannes Loebbecke,Dominik Voigt,Nataliia Klievtsova,Matthias Ehrendorfer
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agentic Business Process Management has gained momentum recently. The prospect is that the autonomy of AI agents, i.e., predominantly LLM-based agents, can be balanced with a certain level of robustness, tractability, and traceability through a combination with process technology. In this paper, we provide a classification framework for agentic orchestration options along properties such as task specificity, traceability and tractability, autonomy and reactivity, and correctness assurance and present qualitative decision criteria for realizations of different scenarios. We also provide metrics for the quantitative assessment of realization properties and show them through different agentic implementations of a predictive light sensing scenario. Altogether, this work aims at providing properties, criteria, and metrics for the design and implementation of agentic orchestrations and orchestration of agents.
[AI-44] Surprise as a Signal for Plasticity and Metacognition
链接: https://arxiv.org/abs/2606.31495
作者: Louis Mouchon
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We study a single idea across two settings: that a prediction-error signal, computed by a small predictor over the latent space of a frozen encoder, can serve both as a gate on plasticity and as a substrate for metacognition. In the first system, a non-parametric episodic memory writes a new concept only when this surprise is high, and a periodic offline replay phase consolidates recent traces into a slow linear readout. On a continual stream of 1000 ImageNet classes with a frozen DINOv2 or I-JEPA backbone, the consolidation phase recovers 17.7 points of retention on the oldest classes for DINOv2 and 51.3 points for I-JEPA (single-seed runs), and an ablation shows that replaying only a recent window is worse than no replay at all. In few-shot evaluation the same memory reaches 91.6% on 5-way 1-shot mini-ImageNet, above a task-specific baseline, while a harder 500-way regime exposes the true difficulty. In the second system, the same surprise signal, computed in a shared text-image space, modulates the behaviour of a vision-language model: it answers assertively when a concept is known, hedges when it is partially familiar, and refuses to identify the object and asks for an explanation when it is novel, learning the concept from a single user utterance. The external detector separates known from novel concepts at an AUROC of 0.966 (95% CI +/-0.024), far above the model’s own verbalised confidence (0.618), while its token-level confidence sits below chance under greedy decoding; after a sleep phase that empties the fast store, the system recalls 99.2% of fifty taught facts from the consolidated store while a base model recovers none. We report both systems as proof-of-concept, with explicit limitations, and position the second against recent episodic-memory and personalised-VLM work.
[AI-45] Robustness of Robotic Manipulation: Foundations and Frontiers
链接: https://arxiv.org/abs/2606.31494
作者: Yifei Dong,Zhanyi Sun,Lujie Yang,Manuel Baum,Kei Ikemura,Shuran Song,Florian T. Pokorny,Xianyi Cheng
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Humans and animals exhibit remarkable robustness in physical manipulation, yet robots remain far behind. Progress toward human-level manipulation robustness is hindered by the absence of a unified and systematic understanding: different subfields frame robustness in distinct ways, often leaving the concept ambiguous and limiting deeper analysis as well as communication across research areas. This paper presents a systematic study of manipulation robustness. We begin with a formal definition, characterizing robustness as the degree to which a manipulation system can achieve its goal in the presence of uncertainty and variation. Building on this definition, we introduce general formulations of manipulation robustness from probabilistic and control-theoretic perspectives. We then synthesize the guiding principles and concrete mechanisms of manipulation robustness across perception, planning, control, policy learning, and hardware, illustrating each mechanism through representative works, including foundational and recent studies. In addition, we revisit existing metrics and evaluation methods for quantifying manipulation robustness. Finally, we distill broader lessons for designing robust manipulation systems and discuss open problems and future directions toward achieving human-level robustness in robotic manipulation.
[AI-46] CLOUDADV: Decision-Aligned Instance Sizing with Zero-Shot Foundation Models under Drift
链接: https://arxiv.org/abs/2606.31470
作者: Jack Bell,Giacomo Carfi,Gerlando Gramaglia,Andrea Simioni,Daniele Fontani,Vincenzo Lomonaco
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures
Abstract:Cloud virtual machines are often overprovisioned, creating avoidable cost and operational inefficiency. We present CLOUDADV, an interactive engineer-facing advisory system for cloud instance sizing under workload drift. The system combines zero-shot time-series forecasting with bounded recommendation generation across day-, week-, and month-scale planning horizons. For each query, CLOUDADV constructs a structured decision context from historical utilization, forecast summaries, current VM metadata, candidate instance options, pricing, and explicit sizing heuristics. A higher-capacity LLM is used offline to generate reference recommendations, while a smaller production model is evaluated on the same prompts to assess deployment-time alignment under latency and cost constraints. Evaluation prioritizes downstream recommendation quality using simulated Azure cost savings and ex-post exceedance, with rolling-origin forecast accuracy reported as a secondary diagnostic against classical and supervised baselines. In a case study of seven production VMs, the reference recommendations reduce simulated monthly cost from about \ 1,503 to \ 708, yielding \ 795/month in savings (52.9%) under conservative heuristic constraints, while the highest observed exceedance rate among downgraded cases is 1.5%. Although Chronos-2 does not minimize every forecasting metric, it often induces recommendation patterns similar to those of a supervised per-VM baseline. These results suggest that zero-shot foundation models can support decision-aligned provisioning in non-stationary cloud environments while reducing the operational burden of repeated per-tenant retraining, revalidation, and redeployment.
[AI-47] CSTrader: A Testbed for Language-Grounded Trading in a Community-Driven Virtual Asset Market
链接: https://arxiv.org/abs/2606.31461
作者: Yao Shi,Kingfung Luo,Nan Tang,Yuyu Luo
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:Niche asset markets, such as Counter-Strike 2 (CS2) weapon skins, are small, volatile, and heavily driven by community discussions and platform rules. These properties make them hard for traditional quantitative models, but provide an ideal testbed for studying how large language models (LLMs) turn unstructured text into trading actions. We present CSTrader, a multi-agent framework for language-grounded trading in the CS2 skin market. The system first integrates heterogeneous signals from various sources, then uses specialized agents for technical analysis, liquidity, events, and (reversed) sentiment, and finally applies risk control, transaction friction, and portfolio management agents to produce buy, sell, or hold decisions under realistic trading frictions. We build a live-like evaluation environment with real CS2 data from a highly volatile period and evaluate several recent LLM backbones. Across models, CSTrader consistently outperforms both a falling market index (-15.62%) and simple single-prompt LLM baselines, achieving up to a 7.58% cumulative return with controlled risk. Ablation studies show that liquidity, reversed sentiment, and transaction friction agents are crucial for turning noisy language signals into stable profits, suggesting that niche, language-driven markets are a useful benchmark for future language-to-action research. Code is available at: this https URL
[AI-48] UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation ECCV2026
链接: https://arxiv.org/abs/2606.31451
作者: Jiahang Tu,Fengyu Yang,Chenyang Ma,Xihang Yu,Ziyao Zeng,Shaokai Wu,Hanbin Zhao,Zhi Tao,Chao Zhang,Hui Qian,Alex Wong
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by ECCV 2026
Abstract:Unified multimodal models (UMMs) have shown great promise in integrating understanding and generation across diverse modalities. However, existing research rarely extends this paradigm to the tactile domain, where both object-level semantics and sensor-level configurations jointly determine the meaning of touch. To address this gap, we propose UniTac, the first UMM designed for tactile understanding and generation. UniTac models the tactile process as a transition from non-contact to contact, capturing the physical interaction between sensors and objects through a dual-level representation that encodes both sensor and object attributes. For tactile understanding, UniTac introduces two tasks, object property description and sensor identification, to enhance reasoning over physical and cross-sensor information. For tactile generation, we design a two-stage training paradigm consisting of reconstruction and alignment, together with a sensor-prior-based sampling strategy that simulates realistic tactile contact. Trained on large-scale multi-sensor datasets, UniTac achieves state-of-the-art performance in tactile understanding and generates realistic tactile signals across sensors.
[AI-49] Who Determines the Meaning of an Emotion? Affective Sovereignty as an Epistemic Consequence of Measurement Limits
链接: https://arxiv.org/abs/2606.31442
作者: Keito Inoshita
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotion-sensing AI is rapidly becoming embedded in vehicles, home appliances, dialogue agents, and social infrastructure, giving rise to a sphere in which emotion is no longer confined to individual experience but is instead observed and computed at a societal scale, a domain we term the Affectosphere. Yet a central normative question in this domain has remained underexplored: who has the final authority to determine the meaning of one’s own emotion? This study addresses the question from the epistemological side of measurement’s structural limits. We define a meaning distribution as the distribution of labels assigned by annotators drawn from a population under a fixed annotation protocol, and decompose its uncertainty into reducible and irreducible components. We then demonstrate that, while emotion AI can assign high-confidence point labels and discriminate real differences at an aggregate level, the irreducible component of the meaning distribution for individual instances cannot be estimated with adequate coverage under realistic annotator counts, a systematic divergence we term the epistemic gap. The key finding is that high device confidence does not constitute evidence that irrecoverable meaning has been recovered. From this epistemic gap, together with an explicitly stated normative premise, namely that the output of a system which cannot recover a quantity in principle must not be treated as its authoritative determination, we derive the norm that the final interpretive authority over the meaning of one’s emotion is procedurally reserved for the experiencing subject, the norm of affective sovereignty. These results suggest that the design, evaluation, and regulation of emotion AI should place explicit allocation of interpretive authority, rather than accuracy maximisation, at their core.
[AI-50] DA-Studio: An Agent ic System for End-to-End Data Analysis VLDB2026
链接: https://arxiv.org/abs/2606.31423
作者: Yizhe Liu,Shaolei Zhang,Ju Fan
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注: VLDB 2026 Demo submission
Abstract:Real-world data analysis is a multi-step process over heterogeneous inputs rather than merely producing a final answer. A practical system should autonomously organize multi-step workflows, execute generated code in a sandboxed and controllable environment, and remain inspectable through visible action traces and intermediate artifacts. Existing LLM-based analysis tools, however, often emphasize isolated subtasks, leaving limited support for complete execution-grounded workflows. We present DA-Studio (Data Analysis Studio), an interactive web-based demo system for end-to-end data analysis that is autonomous, sandboxed, and inspectable. DA-Studio integrates an action-structured analysis backend, a sandboxed execution workspace, and a browser interface for task setup, streamed action traces, artifact preview, code editing and rerunning, and report export. Through iterative action generation, code execution, and feedback incorporation, it incrementally constructs executable analysis steps from raw files and natural-language requests while exposing intermediate results and artifacts throughout the process. Comments: VLDB 2026 Demo submission Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2606.31423 [cs.DB] (or arXiv:2606.31423v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2606.31423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-51] Ask the World Before Acting: Budgeted Environment Probing for World-Model Calibration
链接: https://arxiv.org/abs/2606.31422
作者: Xinyuan Song,Zekun Cai
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Long-horizon language agents do not only choose actions; they carry a private model of the world from one decision to the next. When that model drifts, a later failure can be decided before the failing action is ever taken. We study a direct repair mechanism: before committing to the next task action, an agent may ask the environment about one belief field and write the answer back into its world model. This makes environment interaction a scarce calibration resource, not merely a way to advance the task. We introduce \method, a budgeted probing operator for structured belief tables. The useful probes are not the same everywhere. Procedural beliefs, such as tool dependencies, can often be repaired by targeted checks, but those checks spend steps that the task may need. Spatial beliefs, such as object locations and graph edges, rely more on structural cues; the agent’s own confidence can be a poor guide when the world changes off-screen. A type-stratified analysis formalizes this probe-action frontier, and controlled experiments show that mid-planning environment evidence reduces terminal world-model error when the probe policy follows the structure of the task.
[AI-52] BP-TTA: Balanced and Prototype-Guided Test-Time Adaptation in Dynamic Scenarios
链接: https://arxiv.org/abs/2606.31420
作者: Shaoyang Huang,Yashi Zhu,Yichen Yu,Lei Zhang,Zhang Yi,Tao He
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Test-Time Adaptation (TTA) enables models trained on a source domain to adapt online to unlabeled test data under distribution shifts. While recent TTA methods have moved beyond static settings and begun to consider continual domain shifts, they primarily address distribution drift and fail to account for class imbalance in dynamic scenarios. In real-world test-time streams, class imbalance and continual domain shifts often occur at the same time and interact with each other. In this paper, we propose a novel Balanced and Prototype-Guided Test-Time Adaptation (BP-TTA) method, which combines batch-balanced sampling with prototype-guided adaptation to handle the class imbalance and continual domain shift problems. BP-TTA constructs balanced adaptation batches by integrating current samples with high-confidence historical instances, effectively mitigating bias toward dominant classes and stabilizing online updates. Meanwhile, BP-TTA maintains evolving class prototypes during inference and leverages prototype similarity as a constraint for model adaptation, thereby improving the reliability of pseudo-labels and enhancing the stability of online updates under persistent domain shifts. Extensive experiments demonstrate that BP-TTA consistently outperforms state-of-the-art TTA methods in dynamic test-time streaming settings.
[AI-53] Learning to Select Not Relearn: Hard-Routed Mixtures of Reasoning LoRAs
链接: https://arxiv.org/abs/2606.31413
作者: Seyed Alireza Molavi,Zhan Su,Yan Hu,Peyman Sheikholharam Mashhadi,Stefan Byttner,Prayag Tiwari
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Code available at: this https URL
Abstract:Composing independently trained LoRA adapters into a single large language model is useful for multi-domain adaptation, especially when the original training data cannot be shared. A common approach is to use MoE-style routing over LoRA experts, but for frozen pretrained adapters, soft weighted combinations can change the unit-scale additive update under which each LoRA module was originally trained. We propose \textbfHard-Routed MoR-LoRA, a two-stage framework for composing frozen reasoning LoRA experts through unit-scale hard selection. First, domain-specific LoRA adapters are trained independently using reinforcement learning from verifiable feedback to obtain reasoning experts. Then, all experts are frozen, reasoning traces are distilled from them, and only a lightweight shared router together with a small attention LoRA is trained for integration. The router selects exactly one expert per token using hard top-1 routing, while a straight-through estimator enables gradient-based training. Experiments across five benchmarks, multiple model scales, and additional model families show that Hard-Routed MoR-LoRA preserves expert behavior while requiring substantially fewer trainable parameters than soft-routing mixture baselines. Our analysis further shows that normalized soft mixtures often concentrate most routing mass on a single expert, suggesting that hard unit-scale routing provides a simple and efficient abstraction for frozen LoRA expert composition.
[AI-54] Xiaomi-GUI-0 Technical Report
链接: https://arxiv.org/abs/2606.31410
作者: Wanxia Cao,Chengzhen Duan,Pei Fu,Pengzhi Gao,Niu Lian,Fazhan Liu,Hui Liu,Heng Qu,Qinzhuo Wu,Zhehao Yu,Tongbo Chen,Shiqi Cui,Anan Du,Shukai Jia,Yuanfa Li,Yike Liu,Wenchao Lu,Haoyuan Sun,Jiatong Sun,Cheng Tan,Yajie Wang,Changqiao Wu,Tao Xiong,Jiahui Yang,Yuxuan Yuan,Ruoceng Zhang,Shaojie Zhang,Jian Zhu,Jian Luan,Cong Zou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Graphical user interface (GUI) agents build on vision-language models to complete user tasks end-to-end in real applications through interface actions such as tapping, swiping, text entry, and navigation. However, existing GUI agents are trained and evaluated largely on offline trajectories, simulated environments, and standardized benchmarks. These differ substantially from real applications in interface layout, interaction logic, and abnormal-state distribution, and cannot faithfully characterize execution stability in real-world use, where account states, permission dialogs, payment authentication, and risk control continually reshape the state distribution and open a persistent gap between benchmark scores and real usability. To close this gap, we propose Xiaomi-GUI-0, a native multimodal GUI agent for real mobile environments, trained and evaluated within a real-device closed loop. At its core is a real-device-dominant hybrid infrastructure, where physical devices are the primary execution environment and sandboxes provide auxiliary support, so that data collection, training, rollout, and evaluation share an execution distribution close to real deployment. We construct multi-source training data spanning high-frequency head tasks, high-generalization data for long-tail intents, and capability-enhancement data for reflection and memory, and introduce an error-driven data flywheel that turns failure trajectories into corrected actions, reflective explanations, and recovery demonstrations. The model is trained through a progressive three-stage pipeline of supervised fine-tuning, step-level reinforcement learning, and agentic reinforcement learning. Evaluated on public benchmarks and our in-house RealMobile, Xiaomi-GUI-0 achieves 72.0% success on RealMobile and 78.9% on AndroidWorld, while substantially improving execution stability and abnormal-state recognition in real-world tasks.
[AI-55] Wisdom Of The (AI) Crowd: Investigating Artificial Swarm Intelligence In Large Language Models
链接: https://arxiv.org/abs/2606.31404
作者: Justin Brenne,Christian Meske
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 0 figures, 6 tables, Accepted at ECIS 2026 (European Conference on Information Systems), Track: General Track, Paper No. ECIS2026-1499
Abstract:Human swarm intelligence demonstrates remarkable collective accuracy but faces scalability constraints in cost, coordination, and time. We investigate whether large language models (LLMs) can approximate swarm intelligence effects through artificial swarms, addressing a critical gap in understanding AI-based aggregation mechanisms. We conducted a controlled experiment with 960 manually executed prompts across three proprietary models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5), testing intra-model sampling and inter-model aggregation on eight estimation tasks. Results reveal consistent error reduction through intra- and inter-model aggregation, with significant error reductions up to 37 percentage points in MAPE across different aggregation strategies. We observed small to large effect sizes for positive correlations (Spearman’s \rho=0.242-0.568 , all p0.001 ) between relative confidence interval widths and relative estimation errors, suggesting LLMs possess metacognitive awareness when assessing uncertainty. We discuss implications for research and practice, providing actionable insights for deploying LLM swarms in organizational decision-making.
[AI-56] World-Model Collapse as a Phase Transition
链接: https://arxiv.org/abs/2606.31399
作者: Xinyuan Song,Zekun Cai
类目: Artificial Intelligence (cs.AI)
备注: Under Review
Abstract:Water looks unchanged as it warms, then at a critical point it boils. We ask whether long-horizon language agents show an analogous transition in their implicit world models. In some parameter settings, changing state load by a small amount, or adding a single step of horizon, leaves behavior nearly unchanged; near a critical boundary, the same small change causes a sudden world collapse. We study this effect in a deterministic task family with exact per-step gold state. A large grid search over state cardinality, dependency density, horizon, branching, observation mode, and mutation rate reveals a phase diagram: a solved plateau, a narrow transition band, and a collapse floor. Per-step traces show the mechanism: world-state fidelity fails before action validity, so the agent is not merely choosing a bad action; it is acting from a corrupted world. Stronger models translate the critical boundary but do not remove the qualitative transition. These results make world-model collapse a measurable bottleneck for long-horizon agents.
[AI-57] Mixture-of-Control: State-Aware Fine-Tuning for Transformer-based Models ICML2026
链接: https://arxiv.org/abs/2606.31397
作者: Duc Anh Nguyen,Tien Ngoc Luu,Tung Pham,Toan Tran
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICML 2026 Workshop on Connecting Low-rank Representations in AI, CoLoRAI, 26 pages, 12 figures, 5 tables
Abstract:State-based fine-tuning has emerged as a compelling alternative to weight-based adaptation for transformers, updating lightweight controls into states rather than model weights, offering substantial memory savings while retaining parameter efficiency. However, most existing state-based methods typically apply only per-block control updates, which limits inter-block information exchange and restricts representational adaptation. Meanwhile, prior mechanisms that enable cross-block communication often introduce considerable computational overhead, reducing their practicality for efficient fine-tuning. We introduce Mixture-of-Control (MoC), a lightweight fine-tuning framework that adaptively integrates local and global control signals to enhance representation learning. MoC treats block-wise control states as experts in a sparse mixture-of-experts process, enabling efficient communication across transformer blocks. Empirical results across diverse transformer-based benchmarks demonstrate that MoC outperforms state-based methods while maintaining a comparable memory and computational efficiency.
[AI-58] ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents
链接: https://arxiv.org/abs/2606.31392
作者: Binjie Zhang,Mike Zheng Shou
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Tool-augmented vision-language models (VLMs) can solve multimodal, multi-step tasks by calling external tools, yet they remain fragile in practice. Existing works have two common gaps. Supervised fine-tuning (SFT) is built mostly on successful trajectories and offers little signal for recovery after tool failures, while sparse trajectory-level RL rewards provide limited guidance on which step failed and how to repair it. We introduce ReGRPO (Reflection-augmented Group Relative Policy Optimization), a framework that learns reflection-guided correction in tool-using agents. ReGRPO starts with a structured reflective data engine: we execute near-miss actions to collect grounded failure observations, then build Reflection-of-Thought triplets (ErrorType, Evidence, FixPlan) paired with corrected actions for warm-start SFT. We then optimize reflection tokens and corrective actions jointly within local trajectories using group-relative advantages, and include a reflection-cost term to reduce unnecessary reflection. Experiments on GTA and GAIA show that, under the same backbone and tool suite, ReGRPO consistently outperforms strong open-source baselines and achieves the best results among the compared open-source controllers. Code and RoT data are available at this https URL.
[AI-59] Stage-Transition Dense Reward Modeling for Reinforcement Learning
链接: https://arxiv.org/abs/2606.31377
作者: Yang Yang,Bingjie Chen,Zihan Wang,Yizhe Li,Guoping Pan,Yi Cheng,Houde Liu
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages,3 figures
Abstract:Reinforcement learning for long-horizon robotic manipulation is often limited by sparse and delayed rewards, while manually designing dense shaping signals is costly and brittle to changes in environments and object configurations. This work proposes Stage-Transition Dense Reward (STDR), a visual reward-learning framework that converts unstructured expert videos into logically grounded dense rewards for training RL agents from scratch. STDR leverages semantic understanding to infer a task’s stage structure from demonstrations, and delivers two complementary learning signals during online training: (i) stage-transition feedback that provides goal-directed reward, and (ii) within-stage progress feedback that supplies fine-grained guidance toward completing each stage. Furthermore, an out-of-distribution (OOD) detection mechanism and a grasping regulation module are integrated to enhance robustness and prevent reward hacking. Experiments on 14 manipulation tasks across MetaWorld, ManiSkill, and Franka Kitchen show that STDR consistently improves sample efficiency and success rates over multiple baselines, and matches or surpasses handcrafted dense rewards on several challenging tasks. Real-robot evaluations further indicate that STDR assigns stable, progress-aligned rewards on successful executions while producing appropriately low rewards for failures, suggesting robustness to visual noise and better-calibrated reward assignment across settings.
[AI-60] Smart charging of large fleets of Electric Vehicles: Independent Multi-Agent Reinforcement Learning approaches
链接: https://arxiv.org/abs/2606.31347
作者: Xavier Rate,Eloann Le Guern,Raphaël Féraud,Fatma Salem,Melissa Chiknoun,Eymeric Giabicani,Mehdi Feki,Patrick Maillé,Guy Camilleri,Anne Blavette,Hamid Benhamed
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The electrification of transportation through electric vehicles introduces new challenges for power grid management, such as increased peak demand, voltage fluctuations, line overloads, and the integration of variable renewable energy sources. To enable efficient integration of EVs while minimizing costs for users and avoiding network overloads, implicit coordination between EVs is required. This work compares two independent multi-agent reinforcement learning approaches for optimizing such decentralized EV charging: contextual combinatorial bandits and policy gradient algorithms. Using a realistic simulation environment with autonomous agents making decisions based on local environmental information (including price signals, state-of-charge, and temporal constraints), we evaluate their performance across varying congestion levels, and mixed-strategy configurations with heterogeneous agent groups under dynamic electricity pricing derived from real photovoltaic production data.
[AI-61] Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models ICML2026
链接: https://arxiv.org/abs/2606.31338
作者: Yujun Lee,Joonhyeok Shin,Hyoeun Kim,Kyuhong Shim
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Workshop on Machine Learning for Audio, ICML 2026
Abstract:Recent music audio-language models achieve high accuracy on instrument question-answering benchmarks, but it remains unclear whether this reflects robust audio grounding or benchmark-specific shortcuts. In this paper, we introduce an OpenMIC-derived diagnostic benchmark sequence for instrument grounding in music audio-language models, extending binary instrument-presence QA to genre-prior-reduced examples, confusable instrument discrimination, longer audio context, and temporal localization. Across these settings, high binary QA accuracy often fails to predict model behavior: models can exhibit option-position bias, confusable-instrument errors, and temporal response bias. These results suggest that instrument grounding should be evaluated with multi-axis diagnostic benchmarks rather than a single aggregate accuracy.
[AI-62] Optimization Algorithms for Joint OFDM Waveform Design and RIS Configuration in 6G Networks: From Convex Relaxation to Foundation Models
链接: https://arxiv.org/abs/2606.31334
作者: Ahmet Kaplan
类目: Artificial Intelligence (cs.AI)
备注: 22 pages
Abstract:Joint OFDM-RIS optimization for 6G is a mixed-integer nonlinear programming (MINLP) problem covering sum-rate maximization, energy efficiency, max-min fairness, and peak-to-average power ratio (PAPR)-constrained objectives. Seventy-eight joint OFDM-RIS optimization works published between 2021 and 2026 are surveyed. No standardized benchmark exists, and cross-paper comparisons remain infeasible. This survey classifies these works into four paradigms: (I) model-based convex relaxation, (II) heuristic and metaheuristic search, (III) deep reinforcement and unsupervised learning, and (IV) emerging methods including foundation models (FM), diffusion-based generative AI, and quantum optimization. A literature synthesis of self-reported benchmarks shows that ML-based methods (Paradigm~III) report 95-99% of model-based spectral efficiency at 10^2-10^4 x faster per-inference runtime (method-pair dependent; literature values are self-reported and exclude ML pre-training cost). A companion tutorial benchmark at N=16, N=64, and N=128 reveals a critical scaling property: GPU-based neural network inference (DDQN, PPO, graph neural network (GNN), unsupervised DL) is N-invariant, with identical runtime at N=16 and N=128, while iterative solvers (AO+SCA, PSO) scale polynomially. Energy efficiency (P2) and PAPR-constrained (P4) benchmarks are deferred to future work with standardized power models and waveform generators. Six open challenges emerge from the synthesis: the cross-paradigm benchmark deficit, real-world hardware-constrained deployment, joint waveform-RIS optimization for doubly-dispersive channels, multi-objective PAPR trade-offs, LLM safety in live network control, and diminishing returns of standalone heuristics. We specify requirements for a standardized benchmark. This study serves as a roadmap for researchers and practitioners working on joint OFDM-RIS optimization in 6G networks.
[AI-63] CryoACE: An Atom-centric Framework for Accurate and Automated Model Building in Cryo-EM
链接: https://arxiv.org/abs/2606.31332
作者: Minzhang Li,Mingrui Li,Weichen Qin,Qihe Chen,Sixian Shen,Yuan Pei,Jiakai Zhang,Jingyi Yu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Protein automodeling from cryo-EM density maps faces unique challenges in enforcing physicochemical validity and managing conformational heterogeneity. Current solvers are often limited to static predictions or require computationally intensive heuristic searches. We present CryoACE, an end-to-end framework that reconstructs precise atomic graphs for both homogeneous and heterogeneous structures. Our method features two key innovations: an atom-centric reconstruction paradigm, where density features are sampled directly at atomic coordinates and iteratively recycled to refine structures, replacing expensive voxel convolutions for efficient multimodal fusion; and a training-free guidance mechanism that leverages predicted local resolution priors to resolve dynamic ambiguity. Validated on a newly constructed high-quality dataset, CryoACE significantly outperforms existing baselines on static benchmarks and, for the first time, unveils atomic-level dynamic conformations on complex real-world datasets like EMPIAR-10345 without relying on pre-built static structures.
[AI-64] 3D HAMSTER: Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance IROS
链接: https://arxiv.org/abs/2606.31329
作者: Dongyoon Hwang,Byungkun Lee,Dongjin Kim,Hyojin Jang,Hoiyeong Jin,Jueun Mun,Minho Park,Hojoon Lee,Hyunseung Kim,Jaegul Choo
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Published in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2026. Code: this https URL . Project page: this https URL
Abstract:Hierarchical Vision-Language-Action (VLA) models decouple high-level planning from low-level control to improve generalization in robot manipulation. Recent work in this paradigm uses 2D end-effector trajectories predicted by a Vision-Language Model (VLM) as explicit guidance for a downstream policy. However, state-of-the-art low-level policies operate in 3D metric space on point clouds, and feeding them 2D guidance that lacks depth forces each waypoint to be assigned the depth of whatever scene surface lies beneath it, producing geometrically distorted trajectories. We propose 3D HAMSTER, a hierarchical framework that closes this gap by having the planner directly output metrically reliable 3D trajectories. We augment a VLM with a dedicated depth encoder and a dense depth reconstruction objective to predict 3D waypoint sequences, which are directly integrated into a pointcloudbased low-level policy. Across 3D trajectory prediction, simulation, and real-world manipulation, 3D HAMSTER consistently outperforms proprietary VLMs and 2D-guided baselines, with the largest gains under appearance-altering shifts and unseen language, spatial, and visual conditions. The project page is available at this https URL.
[AI-65] HistoriQA-ThirdRepublic: Multi-Hop Question Answering Corpus for Historical Research Parliamentary Debates from the French Third Republic (1870-1940)
链接: https://arxiv.org/abs/2606.31325
作者: Aurélien Pellet(LRE),Julien Perez(EPITA, LRE),Marie Puren(LRE, CJM)
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present HistoriQA-ThirdRepublic: a French-language dataset of multi-hop historical questions derived from parliamentary debates and newspapers of the French Third Republic. Designed in collaboration with a historian, the corpus captures complex reasoning patterns typical of historical inquiry, including cross-source synthesis, temporal reasoning, and the integration of sparse evidence. The dataset is made of 1782 questions and emphasizes multi-hop connections across heterogeneous historical documents, providing a resource for evaluating retrieval-augmented and large language model systems in domain-specific contexts. We describe the methodology for constructing the corpus, including the selection and alignment of sources, question validation, and metadata integration. While the dataset focuses on French historical documents, our methodology can be readily adapted to other languages and national corpora. Finally, we demonstrate how the corpus can support realistic evaluation scenarios for multi-hop question answering, bridging the gap between NLP benchmarks and the needs of historical scholarship.
[AI-66] CSO-LLM : Class Subspace Orthogonalization for Post-Training Backdoor Detection and Trigger Inversion in LLM s
链接: https://arxiv.org/abs/2606.31309
作者: Zhengxing Li,David J. Miller,Guangmingmei Yang,George Kesidis
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:While post-training backdoor detection and trigger inversion schemes have been developed for AIs used e.g. for images, there is a paucity of such methods for LLMs. First, the LLM input space is discrete, with up to 150,000^k k-tuples to consider with k the token-length of a putative trigger. Second, one must blacklist tokens typical of the putative target response (class) of an attack, as such tokens may give false detection signals. However, a comprehensive blacklist is not available, in general, for a given domain. We develop a highly effective detection and inversion framework for LLMs treated as classifiers. Central to our approach is class subspace orthogonalization (CSO), a novel plug-and-play paradigm for backdoor detection that serves two fundamental roles when applied to LLMs: i) it enhances both sensitivity and specificity of a baseline detector; ii) it provides a form of implicit blacklisting, as it penalizes against inclusion, in a candidate trigger, of tokens that induce signal perturbations “in the direction of” the putative target class of an attack. One version of our detector performs continuous optimization in token embedding space, while a companion trigger-inversion and detection method performs greedy accretion in discrete token space. Our methods give both strong detection performance and accurate inversion of ground-truth triggers on several LLM classification domains, and for several different LLM architectures.
[AI-67] Benchmarking Large Language Models on Floating-Point Error Classification
链接: https://arxiv.org/abs/2606.31308
作者: Lisa Taldir,Muhammad Ahmad Saeed,David Defour,Pablo de Oliveira Castro(LI-PaRAD),Eric Petit
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper investigates the capability of Large Language Models (LLMs) to detect and classify floating-point errors statically in software code. We introduce InterFLOPBench, a benchmark of 90 C kernels with 1 130 test samples designed to evaluate LLMs across six categories of floating-point error: cancellation, comparison, division by zero, overflow, underflow and NaN, compared across 14 LLMs. The evaluation framework treats floating-point error detection as a multi-label classification problem and employs the F1-score metric to measure performance. Results demonstrate that latest models (Qwen 3 32b, Gemini 2.5 Flash, Phi 4 Reasoning, DeepSeek R1T2, and gpt-oss 20b and 120b) achieve a performance greater than 0.88 overall F1-score. Performance varies between error categories, between explicit operations such as division by zero (Average F1-score: 0.8479) and more subtle numerical phenomena such as underflow (Average F1-score: 0.6059) and cancellation (Average F1-score: 0.6164).
[AI-68] Spatial Reasoning via Modality Switching Between Language and Symbolic Representation
链接: https://arxiv.org/abs/2606.31285
作者: Shreya Rajpal,Tanawan Premsri,Parisa Kordjamshidi
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human reasoning is inherently multimodal: when problems become difficult, we rarely think in words alone. We often externalize our reasoning by sketching diagrams or drawing grids to understand the underlying conceptual structure and avoid mistakes. Building on this premise, our research investigates: (a) whether grounding multi-hop textual-spatial stories into geometry-aware modalities, such as layouts or grids, improves reasoning compared to natural language-based inference; and (b) whether a model can decide when to rely on natural language reasoning and when to switch to a structured modality. We address these questions by introducing a switching metric based on trustworthiness and complexity signals, which estimates when grounding a spatial story into structure is likely to improve performance. This takes a first step toward principled modality selection in Large Language Model (LLM) reasoning. Across our settings, switching from natural language-based reasoning to a grid-based representation improves LLM performance by up to 42%, highlighting the importance of modality choice in shaping reasoning outcomes.
[AI-69] DGT: A Tabular Data Generation Toolkit supporting adaptive GPU-accelerated Bayesian mixture models diffusion-based models and latent-space generative modeling
链接: https://arxiv.org/abs/2606.31268
作者: Vasileios C. Pezoulas,Nikolaos S. Tachos,Eleni Georga,Kostas Marias,Manolis Tsiknakis,Dimitrios I. Fotiadis
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 47 pages (33 main body, 14 pages supplementary material), 30 figures (12 figures in the main body, 18 supplementary figures), 9 tables (3 tables in the main body, 6 supplementary tables)
Abstract:The growing demand for privacy-preserving data sharing has positioned synthetic data generation as a critical component of responsible AI workflows. Despite notable advances in generative modeling, existing solutions often lack integration of adaptive generation strategies, multi-metric evaluation, and accessible end-to-end generators within a unified web-based toolkit. In this work, we introduce TDGT (Tabular Data Generation Toolkit), a web-based toolkit for synthetic tabular data generation and fidelity assessment. TDGT introduces the Adaptive Bayesian Mixture Synthesizer (ABMS), a novel algorithm that autonomously determines the optimal number of mixture components through iterative cluster quality optimization, eliminating the need for manual hyperparameter configuration. Building upon ABMS, we further propose VAE-ABMS, a hybrid architecture that couples Variational Autoencoder-based latent space learning with adaptive Bayesian mixture synthesis, enabling high-fidelity generation of complex, nonlinear tabular distributions. For large-scale scenarios, TDGT provides a GPU-accelerated variant of ABMS leveraging CUDA-based k-means clustering and Gaussian mixture fitting. Synthetic data fidelity is assessed through eleven statistical fidelity metrics spanning distributional divergence, structural correlation, and sample-level similarity, complemented by privacy risk indicators including k-anonymity scoring and disclosure rate estimation. The web-based toolkit supports a real-time streaming interface with interactive Plotly-based visualizations. TDGT is assessed across datasets from healthcare, socioeconomic modeling, and cybersecurity domains, demonstrating consistent generation fidelity and statistical coherence across heterogeneous feature types and data scales.
[AI-70] SwiftAudio: Data-Efficient Caption-Only Distillation for One-Step Text-to-Audio Diffusion-based Generation
链接: https://arxiv.org/abs/2606.31259
作者: Binh Mai,Tran Quoc Bao Le,Hung Dinh,Cong Tran
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Under review
Abstract:Diffusion-based text-to-audio (TTA) models achieve impressive synthesis quality but suffer from high inference latency due to iterative multi-step denoising. Existing one-step approaches alleviate this issue but still rely on paired text–audio data during distillation. To address these limitations, we propose SwiftAudio, a one-step TTA framework that performs audio-free distillation from a pretrained diffusion teacher using only text captions. Specifically, we adapt Variational Score Distillation (VSD) to the audio domain and introduce a temporal smoothness regularization objective to encourage coherent latent audio representations. This design enables the student model to inherit the teacher’s generative prior without requiring paired audio supervision and allows effective training with only approximately 45K captions. Experiments on AudioCaps and Clotho demonstrate that SwiftAudio achieves state-of-the-art performance among strict one-step methods and substantially narrows the gap to multi-step diffusion systems. Project page: this https URL
[AI-71] Embodied CAD: Solver-Grounded LLM Agents for Parametric B-Rep Assembly Modeling
链接: https://arxiv.org/abs/2606.31252
作者: Fumin Liu,Haoyu Zhou,Fei Hao,Lin Yang
类目: Artificial Intelligence (cs.AI)
备注: This paper contains 12 pages, 7 figures. This is an original unpublished manuscript submitted to the arXiv preprint server, with no prior publication or conference presentation
Abstract:Large language models can write plausible CAD scripts, but reliable industrial CAD modeling requires more than syntactically valid code: every feature, placement, and assembly relation must be accepted by an exact geometric kernel while remaining editable as parametric boundary representation geometry. We present Embodied CAD, solver-grounded LLM agents for parametric B-Rep assembly modeling. Instead of generating a complete script in one pass, the agent iteratively selects actions from a stratified L0-L4 CAD skill library, resolves them into typed geometric operations, executes them in a CAD backend, and uses solver feedback to plan, repair, and learn. The framework combines action grammar constraints, deterministic parameter resolution, and solver-derived rewards for supervised warm-up and GRPO-style refinement. We evaluate Embodied CAD on multi-step mechanical, industrial equipment, and mold-oriented assembly tasks using solver-aligned metrics: executable rate, skill accuracy, operation-family accuracy, exact policy accuracy, and task completion success. The results show that solver-grounded planning executes all strong-planner workflows in the current benchmark, while learned controllers reach high executable rates and expose the remaining gap between valid tool calls and exact long-horizon policy prediction.
[AI-72] Delta-JEPA: Learning Action-Sensitive World Models via Latent Difference Decoding
链接: https://arxiv.org/abs/2606.31232
作者: Zhenghao Zhang,Yuanxiang Wang,Zhenyu Guan,Yujia Yang,Bingkang Shi,Tianyu Zong,Hongzhu Yi,Guoqing Chao,Xingchen Chen,Tiankun Yang,Chenxi Bao,Tao Yu,Jingjing Zhou,Jungang Xu
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Learning visual world models for planning requires compact latent dynamics that remain sensitive to actions, yet reconstruction-free joint-embedding objectives can collapse to action-insensitive representations. We propose Delta-JEPA, an end-to-end reconstruction-free world model that augments latent forward prediction with a Latent Difference Action Decoder (LDAD). Unlike inverse decoders that infer actions from concatenated endpoint embeddings, LDAD reconstructs the executed action from the latent displacement between consecutive observations. This displacement-level supervision directly regularizes transition geometry: adjacent embeddings cannot collapse without losing action information, and different actions are encouraged to induce distinguishable latent changes for rollout-based planning. Delta-JEPA uses only latent prediction and action reconstruction, avoiding pixel reconstruction and distribution-matching regularizers. Across four visual continuous-control tasks, Delta-JEPA improves planning over JEPA-based and representation-learning world model baselines. Ablations show that displacement-based action decoding is consistently more effective than endpoint concatenation, and action-sensitivity analyses show clearer action-conditioned latent responses. These results indicate that supervising latent differences is a simple and effective mechanism for collapse-resistant and action-sensitive world model learning.
[AI-73] Agent ic-Ideation: Sample Efficient Agent ic Trajectories Synthesis for Scientific Ideation Agents
链接: https://arxiv.org/abs/2606.31229
作者: Keyu Zhao,Lingyan Kong,Fengli Xu,Yong Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Ideation plays a pivotal role in scientific discovery. Recent LLM, especially AI Scientist systems, show promising potential for automated ideation. However, existing approaches predominantly rely on pre-defined agentic workflows. This constraint severely limits the flexibility required to navigate the vast search space of scientific literature and the complex action space of research reasoning. Recently, training Agentic LLMs has emerged as a promising direction, offering flexible reasoning frameworks and the capability for autonomous tool utilization. However, there remains a non-trivial challenge: applying previous agentic data synthesis methods to scientific ideation suffers from prohibitively high data synthesis cost. To bridge this gap, we propose Agentic-Ideation, a novel framework comprising an automated trajectory synthesis pipeline and a specialized agentic LLM trained for scientific ideation. Specifically, we first define a comprehensive tool space incorporating three external tools and three cognitive tools. Then we introduce an Oracle-Guided Data Synthesis strategy. By leveraging a reference idea as oracle guidance, this approach steers the multi-agent system to efficiently reconstruct the logical reasoning and tool invocation paths, transforming aimless trial-and-error into directed trajectory generation. Finally, we train the agent on these synthesized trajectories, employing a masking strategy on tool execution results. This ensures the model focuses on decision-making logic without interference from external feedback. Experimental results demonstrate that our method outperforms the SOTA workflow-based baseline by \textbf11.91% in overall quality. Furthermore, our approach improves the sample efficiency of high-quality data synthesis by \textbfover 10 \times .
[AI-74] hinking Before Retrieving: Robust Zero-Shot Composed Image Retrieval via Strategic Planning and Self-Criticism
链接: https://arxiv.org/abs/2606.31222
作者: Gunho Jung,Jeong-Woo Park,Seon Bin Kim,Seong-Whan Lee
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Composed image retrieval requires identifying a target image from a gallery by integrating a reference image with a textual modification instruction. In a training-free zero-shot setting, this task relies on constructing a retrieval-oriented textual query within a frozen vision–language embedding space at inference time. Existing approaches predominantly rely on a single-pass generation strategy that fuses the reference context and modification text into a unified description. This strategy makes it difficult to detect or correct semantic distortions and omissions during generation. Consequently, the preservation of reference attributes and the integration of textual requirements interfere with each other, which degrades retrieval precision. To address these challenges, we introduce PEC-CIR, a training-free framework that structures query construction as a multi-stage reasoning pipeline. The framework operates through a Planner–Executor–Critic architecture where the Planner extracts explicit constraints, the Executor generates multiple candidate target descriptions, and the Critic evaluates these candidates based on constraint compliance. By reframing query construction as a staged inference process instead of a single-pass output, PEC-CIR reduces the propagation of generative errors by explicitly evaluating candidate queries before retrieval, thereby improving retrieval stability.
[AI-75] Information-Aided DVL Calibration
链接: https://arxiv.org/abs/2606.31216
作者: Zeev Yampolsky,Itzik Klein
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:The Doppler velocity log (DVL) velocity measurements are critical to the accuracy of autonomous underwater vehicle (AUV) navigation solutions and, consequently, to mission success. To ensure accurate measurements, the DVL is commonly calibrated before mission start while the AUV sails on the water surface, receiving global navigation satellite system (GNSS) signals that provide accurate reference measurements. Conventionally, Kalman filter-based approaches are employed during calibration to estimate the scale factor and misalignment errors. However, in certain environments, GNSS signals may be unavailable, rendering conventional calibration impossible and forcing the use of uncalibrated DVL measurements, which degrades navigation performance. To address this limitation, this work proposes information-aided calibration (IAC) with two main contributions: first, improving the accuracy of conventional Kalman filter-based calibration in GNSS-enabled environments, and second, enabling GNSS-free DVL self-calibration. Using real-world AUV datasets, the proposed IAC models achieve up to a 20% average improvement in GNSS-enabled environments and up to a 35% improvement in velocity vector estimation during GNSS-free DVL self-calibration. Overall, the proposed approach improves navigation accuracy, reduces navigation drift, and consequently enhances mission reliability.
[AI-76] Long-term Traffic Simulation via Structured Autoregressive Modeling ECCV2026
链接: https://arxiv.org/abs/2606.31209
作者: Lingyu Xiao,Zexin Feng,Xintao Yan
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: ECCV 2026 Accepted
Abstract:Interactive traffic simulation is a vital world model for autonomous driving. A central challenge in long-horizon simulation is modeling sustained multi-agent interactions, which is further exacerbated by dynamic token cardinality as agents continuously enter and exit the scene. In this work, we propose that the solution lies in the synergy between the architectural inductive biases and statistical priors of large-scale sequence models, e.g., Large Language Models (LLMs). Our probing experiments reveal that the transferability of attention mechanisms and the distributional consistency between motion tokens and natural language enable small-scale, heavily frozen LLMs to rapidly adapt to traffic modeling. Building on this insight, we introduce RosettaSim, a unified framework that projects scene topology, agent states, and spawning intents into a structured autoregressive stream with variable length, achieving both strong short-term accuracy and stable long-horizon simulation fidelity. Furthermore, evaluating extended rollouts presents yet another hurdle, as one-to-one agent correspondence inevitably fades over time. To address this, we introduce Retrieval-based Traffic Evaluation (RTE), which retrieves semantically similar real-world scenarios as context-aware reference anchors. Experiments on the Waymo Open Sim Agent Challenge (WOSAC) demonstrate that RosettaSim achieves state-of-the-art performance in both short- and long-term simulation. Furthermore, RTE exhibits a stronger correlation with standard metrics ( r=0.83 ) than existing approaches ( r=0.74 ), indicating improved alignment with long-horizon simulation fidelity.
[AI-77] owards Inclusive Mobility Modeling: Characterizing and Evaluating Elderly Trajectory Patterns in Urban Systems
链接: https://arxiv.org/abs/2606.31207
作者: Zhengxuan Wang,Haohan He,Mengying Zhou
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:
Abstract:The rapid advance of smart cities increasingly depends on trajectory data mining, yet underrepresented demographic groups, particularly the elderly, are often sparsely represented in public mobility datasets. This underrepresentation can introduce systematic bias into mobility modeling and downstream urban planning. Using the 2016-2020 Jersey City subset of the Citi Bike System Data, this study quantitatively examines how the absence of underrepresented subgroups’ mobility signatures affects mobility modeling, using synthetic trajectory generation as a case study. The analysis reveals that elderly riders exhibit a structurally distinct mobility signature, including localized activity spaces (958 m vs. 1,189 m for young riders), lower mobility entropy (1.82 vs. 4.15), and asymmetric off-peak temporal patterns. To demonstrate that relying on majority-dominated training data yields biased synthetic outcomes, we further evaluate both a first-order Markov chain and a Qwen3-4B model fine-tuned with QLoRA across three demographic training settings: the full population, young riders only, and elderly riders only. Results show that models trained on majority-dominated populations systematically misrepresent elderly mobility behavior, particularly for spatial mobility metrics. The Markov model trained on the full population overestimates elderly step length by 4.5% and dwell time by 8.9%, whereas the elderly-specific model achieves substantially lower errors across most metrics. Comparisons between the Markov and LLM-based frameworks further show that higher-capability models do not necessarily improve subgroup-level fidelity under limited demographic data. These findings underscore the importance of demographic representation in mobility modeling and its downstream applications for underrepresented populations.
[AI-78] Agent ic RAG -VLM: Affordance-Aware Retrieval-Augmented Generation with Self-Reflective Planning for Robotic Grasping
链接: https://arxiv.org/abs/2606.31200
作者: Tao Chen,Lizheng Liu,Jiaxu Wang,Ziyue Jiang,Ruiqi Tian,JiGuang Huo,Zhongxue Gan
类目: Artificial Intelligence (cs.AI)
备注: 8 pages,5 figures,5 tables
Abstract:Generalizable robotic grasping in cluttered environments is essential for deploying manipulators in unstructured human spaces, yet existing VLM-based methods rely on visual similarity for object matching, neglecting physical affordances such as handle graspability and material fragility, and operate open-loop without spatial reasoning or failure recovery, limiting their effectiveness when objects are densely packed or physically diverse. We present Agentic RAG-VLM, a unified framework that bridges VLM-based semantic understanding and physically grounded grasp execution by integrating retrieval-augmented generation (RAG) with vision-language models (VLMs) and agentic self-reflective planning. Agentic RAG-VLM introduces three tightly coupled components: (1) a Hierarchical Affordance-Aware RAG (HAA-RAG) that encodes four-dimensional affordance descriptors, including type, material, fragility, and graspable region, and retrieves strategies by functional affordance compatibility rather than visual appearance; (2) a Scene Graph Constraint Reasoner that constructs spatial relationship graphs from VLM perception and translates proximity, occlusion, and support constraints into concrete grasp parameter adjustments; and (3) an Agentic Self-Reflective Pipeline with a 14-type failure taxonomy and three-level adaptive retry for closed-loop grasp refinement. Evaluated on a 12-task benchmark spanning single-grasp, interactive, and long-horizon scenarios with 360 trials per configuration, Agentic RAG-VLM achieves 78.3 percent overall success, a 53.3 percentage-point absolute gain over VLM-only baselines, demonstrating that affordance-aware retrieval, scene graph reasoning, and agentic recovery are jointly essential for robust manipulation.
[AI-79] ransformers as Bayesian In-Context Experimenters: Smoothness-Adaptive Efficient ATE Estimation
链接: https://arxiv.org/abs/2606.31184
作者: Jiachun Li,David Simchi-Levi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Adaptive experiments for average treatment effects (ATE) require randomized allocations balancing valid inference with statistical efficiency. The oracle design is a covariate-dependent Neyman rule governed by unknown arm-conditional outcome variances. We investigate whether this sequential variance-estimation and allocation process can be amortized via in-context learning. We introduce Bayesian in-context experimenters: transformer policies trained to imitate a Bayesian posterior Neyman teacher. The teacher updates nonparametric beliefs over potential outcomes using experimental history to assign posterior Neyman treatment probabilities. This design converges to the oracle rule, supporting efficient ATE inference. Transformers constructively implement this mapping through attention-based sufficient statistics and projected gradient descent, imitating Bayesian updating for Gaussian-series priors. To address unknown outcome smoothness, we combine smoothness-indexed experimenters using a mixture-of-experts transformer. The gate acts as a hierarchical posterior over smoothness classes, concentrating on near-oracle experts. By bounding the complexity of the transformer class, we prove this amortized policy can be learned via empirical risk minimization using supervised pretraining. Experiments confirm accurate teacher imitation, adaptive allocation, and improved ATE precision over baselines.
[AI-80] AI-Assisted Discovery of Convex Relaxations via Dual Agents
链接: https://arxiv.org/abs/2606.31182
作者: Sungyoon Kim,Mert Pilanci
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work shows that LLM agents can improve sharp-constant inequalities by searching for extremal constructions, which yield upper bounds. We address the complementary side: a lower bound holds for every admissible function and follows from a convex relaxation of the nonconvex problem, with tighter relaxations giving stronger bounds. We instantiate the autoresearch paradigm to discover such relaxations: a coding agent proposes valid tightening constraints, a theory agent verifies each one and searches for counterexamples, and every reported bound is certified by an explicit dual-feasible point checked in rigorous interval arithmetic. On two optimization constants studied by \citettao2025alphaevolve - the first autocorrelation inequality ( C_6.2 ) and the Erdős minimum-overlap constant ( C_6.5 ) - we improve the certified lower bounds from 1.28 to 1.2937 and from 0.379005 to 0.37912 , respectively.
[AI-81] AETDICE: Unified Framework and Offline Optimization for Nonlinear Multi-Objective RL
链接: https://arxiv.org/abs/2606.31178
作者: Woosung Kim,Youngjun Suh,Jinho Lee,Jongmin Lee,Byung-Jun Lee
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing nonlinear preferences in multi-objective reinforcement learning (MORL) is essential for capturing complex trade-offs like risk aversion or fairness. However, such non-linearity has historically bifurcated nonlinear MORL objectives into two distinct paradigms: Scalarized Expected Return (SER) and Expected Scalarized Return (ESR). While SER requires global-level optimization and ESR requires non-Markovian policies, leading to fragmented optimization strategies, we bridge this divide through the Aggregation-Expectation-Transformation (AET) framework. By unifying both criteria through a tripartite decomposition of scalarization, AET provides a principled foundation for general nonlinear MORL. Building on this framework, we propose AETDICE, a tractable offline RL algorithm for AET objectives. By utilizing DICE-style density-ratio estimation in an augmented state space, AETDICE enables sample-based optimization from static datasets. Our framework resolves long-standing barriers and captures respective trade-offs induced by AET framework, which existing methods fail to address.
[AI-82] ClawArena-Team: Benchmarking Subagent Orchestration and Dynamic Workflows in Language-Model Agents WWW
链接: https://arxiv.org/abs/2606.31174
作者: Kaiwen Xiong,Haonian Ji,Shi Qiu,Zeyu Zheng,Cihang Xie,Xinyu Ye,Huaxiu Yao
类目: Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, website: this https URL
Abstract:Production large language-model (LLM) agents are increasingly deployed not as lone problem-solvers but as managers: a main model creates specialized subagents, delegates work, and orchestrates their parallel, asynchronous returns through dynamic workflows. Whether one model can actually run such a team is largely unmeasured: existing benchmarks score a policy’s own task-solving or a fixed multi-agent system’s emergent behavior, but none isolate the management ability of the single LLM acting as leader. We introduce ClawArena-Team, a benchmark of 41 multi-turn, multimodal, multi-directory scenarios spanning 258 evaluation rounds and 72 staged updates that measures this management ability. The main agent is deliberately constrained: it natively perceives only text and directly accesses only part of the workspace. It commands a fixed, locally served subagent pool, so score differences reflect management skill, not raw capability. All scoring is execution-based with no LLM judge: an overall score – the Subagent-Management Score (SMS) – multiplies task correctness by a least-privilege and modality-routing factor. Across twelve proprietary, community-hosted, and self-hosted models, experiments show that the management bottleneck is privilege granting rather than perception (no model exceeds 50% workspace-permission precision); that cost and management quality are decoupled (API cost spans over 100 times while the overall score spans under 4 times, with the cheapest open models on the Pareto frontier); and that most leaderboard scores cluster within a 9.9-point band while orchestration behaviors diverge by more than an order of magnitude. Code and data will be released.
[AI-83] Cross-Domain Feature Expansion for Tabular Medical Data via Knowledge Graphs Injection
链接: https://arxiv.org/abs/2606.31171
作者: Mengying Zhou,Yongjie Yin,Haoyan Xin,Guoping Liu,Yang Chen
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Acquiring comprehensive cross-domain biomedical profiles is often costly and time-consuming, resulting in severe data scarcity in medical research. To address this challenge, we propose MedKGTab, a knowledge-injected framework specifically engineered for cross-domain feature expansion in tabular medical data. MedKGTab seeks to infer uncollected biomedical features from available ones by exploiting their inherent statistical dependencies and established medical correlations. By employing a row-column dual-attention mechanism, MedKGTab operates directly on raw structured tabular data, inherently capturing exact numerical distributions without the structural loss caused by tokenization. Crucially, MedKGTab integrates data-driven statistical priors with the SPOKE biomedical knowledge graph, achieving an optimal synergy between the data and knowledge channels. Within this synergy, the representations derived from the data channel are modulated by the injected biomedical knowledge, ensuring the final generated data are grounded in empirical medical research. Experimental results demonstrate that MedKGTab achieves high data fidelity and realistic data representation in cross-domain feature expansion. It outperforms both SOTA medical large models (e.g., Baichuan M3-plus) and specialized tabular models designed for medical data generation. Furthermore, MedKGTab consistently delivers superior performance across various data generation scenarios, whether inferring missing features within the same dataset or generalizing across different medical cohorts.
[AI-84] MIRTH: Mutual-Information Reasoning with Temporal Hubs for Vision-Language-Action Agents ACL2026
链接: https://arxiv.org/abs/2606.31167
作者: Hao Sun,Yu Song,Shiyu Teng,Ziwei Niu,Yen-Wei Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted as main conference paper at ACL 2026
Abstract:VLA models have emerged as a powerful paradigm for transferring semantic knowledge from web-scale data to physical robotic control. However, current single-frame architectures suffer from intrinsic limitations: temporal myopia that discards historical dynamics, reasoning gaps between high-level instructions and low-level motor commands, and inference inefficiency due to autoregressive scalar decoding. In this work, we propose MIRTH, a unified framework designed to address these challenges. MIRTH augments a pretrained VLA backbone with three key innovations: (1) dual-scale temporal memory hubs that compress long-term scene evolution and short-term motion trends into compact embeddings; (2) latent reasoning tokens optimized via a mutual-information objective carving out a semantic plan space to align multimodal context with action trajectories; and (3) a parallel action decoding scheme that replaces autoregressive generation with vector-wise prediction to maximize control throughput. Extensive evaluations on the LIBERO simulation benchmark and a real-world LeRobot platform demonstrate that MIRTH achieves state-of-the-art performance and exhibiting emergent error recovery capabilities. The codes and collected datasets are released at this http URL.
[AI-85] LLM -Powered Interactive Robotic Action Synthesis from Multimodal Speech Gestures and Music IROS2025
链接: https://arxiv.org/abs/2606.31158
作者: Snehasis Banerjee,Ranjan Dasgupta
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IROS 2025 Workshop on Action and Interaction: Humans and Robots in Collaboration
Abstract:The quest for intuitive and natural human-robot interaction (HRI) remains a significant challenge in robotics. Traditional methods often rely on rigid, pre-programmed commands that limit the robot’s expressiveness and adaptability. This paper introduces a novel framework that leverages the reasoning capabilities of Large Language Models (LLMs) to synthesize complex robotic actions from a rich tapestry of multimodal human inputs: natural speech, hand gestures, and music/sound beats. Our system architecture integrates a speech transcription model, a gesture recognition module, and a signal processing pipeline for beat detection. These processed inputs are contextualized using prompt templates and fed into a LLM. The LLM, informed by a predefined robot action space, reasons over the combined inputs to generate a coherent sequence of actions. This sequence is dispatched to an action queue for execution on a quadruped robot over ROS. The framework has ability to interpret and fuse semantic commands from speech, deictic information from gestures, and rhythmic cues from music. This work represents a step towards creating robots that can interact with humans in a more fluid, creative, and context-aware manner.
[AI-86] PPT-Eval: A Benchmark for Computer-Use Agents on PowerPoint Tasks
链接: https://arxiv.org/abs/2606.31154
作者: Apurva Gandhi,Vishwas Suryanarayanan,Raja Hasnain Anwar,Firoz Shaik,Shubhang Desai,Thong Q. Nguyen,Muhammad Taqi Raza,Vishal Chowdhary,Graham Neubig
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
Abstract:Creating and editing slides is a rich, multimodal activity that is ubiquitous in professional and educational settings, making it an ideal testbed for real-world computer-use agents. Microsoft PowerPoint is among the most widely adopted and feature-rich environments for presentation creation. We introduce PPT-Eval, a benchmark of 120 PowerPoint tasks across 12 files that cover both content creation and presentation editing scenarios, organized by difficulty. A central challenge in this domain is evaluation: tasks are complex, multimodal, and often admit many valid solutions. Moreover, today’s agents frequently make only partial progress, which binary success metrics fail to capture. To address this, we design a robust evaluation framework to help create task-specific rubrics for PowerPoint tasks, taking inspiration from and building on past works for rubric-based evaluation. These rubrics award partial credit for intermediate steps, penalize unnecessary changes and poor aesthetics, and provide natural language feedback. This nuanced approach proves highly effective, achieving a Kendall’s \tau-b correlation of 0.77 with human judgments. We find that existing frontier agents still struggle with solving PowerPoint tasks, with strong models like Claude-4.5-Opus achieving only a 45% success rate and an average partial score of 57%. The benchmark is located at: this https URL.
[AI-87] A Modular Vision-Language-Action Robotics Framework for Indoor Environments IROS2025
链接: https://arxiv.org/abs/2606.31144
作者: Anindya Jana,Snehasis Banerjee,Arup Sadhu,Ranjan Dasgupta
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: IEEE IROS 2025 Workshop on Generative AI for Robotics and Smart Manufacturing
Abstract:This paper presents an integrated system for the CMU Vision-Language-Action (VLA) Challenge, designed to enable an autonomous agent to perform complex tasks based on natural language instructions. Our framework employs a modular architecture that orchestrates environment mapping, question processing, and navigation. The system operates in two parallel streams: a perception pipeline that constructs a semantic voxel map from real-time camera feeds using OwlViT embeddings, and a language pipeline that classifies user commands with a Vision-Language Model. The mapping is time-constrained; the system proceeds with a partial map if a 500-second exploration limit is reached. The classified query is then grounded in the geometric and semantic context of the map to generate a detailed prompt for the VLM. This yields an actionable output, demonstrating a capable solution for bridging the gap between human language and robotic action.
[AI-88] Beyond the Library: An Agent ic Framework for Autoformalizing Research Mathematics
链接: https://arxiv.org/abs/2606.31134
作者: Arshia Soltani Moakhar,Iman Gholami,Max Springer,Mahdi JafariRaviz,MohammadTaghi Hajiaghayi
类目: Artificial Intelligence (cs.AI)
备注: preprint
Abstract:While Large Language Models (LLMs) have demonstrated exceptional capabilities in mathematical reasoning, they frequently produce subtle errors that evade human detection. Formal mathematical languages like Lean 4 offer mechanical proof checking, strongly motivating the need for autoformalization: the automatic translation of natural language mathematics into verifiable code. Recent trends indicate that general-purpose LLMs, heavily optimized for standard programming, now outperform smaller models explicitly fine-tuned for Lean. Leveraging this shift, we introduce an agentic autoformalization framework powered by general coding LLMs. At the core of our system is an orchestrator that manages a multi-agent pipeline tailored for research-level mathematics. Because cutting-edge research frequently relies on concepts outside the scope of existing libraries like Mathlib, our system dynamically extends necessary type definitions and validates them via a novel Auxiliary Lemma technique before formalizing the primary theorems. We applied our approach to PutnamBench, producing machine-checked Lean proofs for a random sample of 32 problems. Furthermore, we evaluate our system on five papers from the ACM Symposium on Theory of Computing (STOC) spanning combinatorics, communication complexity, mechanism design, and learning theory, successfully formalizing their main theorems and validating the generated formalizations with human experts; for all five we also formalize the proofs alongside the statements, and notably two of them are proved with no axioms beyond Lean’s kernel. All of our formalizations are available at this https URL .
[AI-89] Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records
链接: https://arxiv.org/abs/2606.31131
作者: Anjali Parashar,Chuchu Fan
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, Appendix included. Paper accepted and presented at NeuS 2026
Abstract:To ensure safe on-road behavior, pre-deployment testing and failure discovery of Autonomous Driving Systems (ADS) is crucial. Present day simulation based testing methods focus largely on mathematical models for efficient search of optimal scenarios, assuming a fixed scenario representation. On the other hand, real-world testing involves substantial manual effort to design scenario templates for testing. These templates represent distinct failure scenarios consisting of pre-deployment vehicle movements, map types, etc. Historical failure records for ADS are a reliable source of real-world failure conditions, which can be used for scenario generation. In this work, we propose a scenario generation pipeline using categorical and contextual information available from historical records in natural language format. Our approach consists of modular LLM based synthetic scenario generation, compatible with the testing constraints of a given system. We successfully apply our method to generate a diverse set of scenarios for testing autonomous navigation on Metadrive simulator using the NHTSA ADS crash records. Our approach results in accurate and diverse scenario generation with a combination of 4 road types, 3 non ego vehicle movement types, including on road anomalies in the form of working zones. Generated scenarios align with the provided testing conditions, and reveals interesting failures of the system within a limited testing budget of 20 scenarios. Code is available at this https URL.
[AI-90] he Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory
链接: https://arxiv.org/abs/2606.31121
作者: Zihan Chen,Songwei Dong,Chengshuai Shi,Peng Wang,Song Wang,Cong Shen,Jundong Li
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Sequentially evolving LLM memory enables agents to reuse past experience, but existing systems usually deploy each locally generated memory update without checking whether it improves future behavior. As a result, updates that help the current task may overwrite useful knowledge, introduce over-specific rules, or bias the final memory toward recent examples. We propose Janus, a plug-in memory controller that decides whether to accept a candidate memory update or retain the previous memory. To make this decision efficient, Janus uses a Memory Momentum Trigger to identify suspicious deviations in the memory-update trajectory, and compares old and new memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks instead of replaying the full history. Janus is method-agnostic and wraps existing updaters without changing their update rules. Across six datasets, two backbone LLMs, and two memory updaters, Janus improves average accuracy by +2.7 to +4.6 points over the corresponding base updaters.
[AI-91] Revealing Safety-Critical Scenarios for UTM via Transformer
链接: https://arxiv.org/abs/2606.31114
作者: Huaze Tang,Bill Zeng,Chao Wang,Zhenpeng Shi,Qian Zhang,Wenbo Ding
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Unmanned Traffic Management (UTM) systems are cloud-based platforms designed to manage and coordinate multiple aerial vehicles remotely. UTM systems are safety-critical which cannot tolerate failures like crash or collision. To reveal latent vulnerabilities, there are neither optimal failure-exposing demonstrations nor clear reward signals. Additionally, UTM’s self-healing capability introduces the ``long-tail effect’’ of critical failures. We propose framing UTM vulnerability discovery as a sequence modeling problem amenable to transformer-based RL architectures. Our approach leverages attention mechanisms to directly model the relationship among system states, and predict optimal actions. Our framework introduces a Policy Model that generates targeted test scenarios and an Action Sampler that enforces domain constraints. We use a risk-based reward function to guide exploration. Through extensive evaluation on a 700-hour simulation study, we demonstrate an 8 \times improvement in vulnerability discovery efficiency compared to expert-guided testing. It also discovers critical edge cases that traditional methods have missed.
[AI-92] What Probing Reveals about Autonomous Driving: Linking Internal Prediction Errors to Ego Planning
链接: https://arxiv.org/abs/2606.31106
作者: Hyeonchang Jeon,Kyungbeom Kim,Eugene Vinitsky,Kyung-Joong Kim
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages
Abstract:Large-scale datasets and fast simulators have enabled improvements in driving policies that appear safe and robust, yet strong performance in nominal scenarios can still mask flawed reasoning and unsafe heuristics. Summary scores from closed-loop simulators do not give significant insight into the policy, making it difficult to determine whether they truly predict the motion of surrounding vehicles, how the ego vehicle generates future plans, or whether they merely rely on brittle heuristics that happen to succeed in nominal scenarios. To better understand the limits and weaknesses of driving policies, we focus on probing for forms of prediction, i.e., where surrounding vehicles will move next, and planning, i.e., understanding how to generate safe trajectories. We focus on these two capabilities because they reflect behaviors expected of effective driving policies, and use their presence or absence to assess policy quality across data-driven behavior cloning and simulation-driven reinforcement learning policies. To evaluate the presence of these capabilities, we investigate them as a function of scale, asking whether the closed-loop gains from larger datasets and longer simulation training reflect stronger prediction and planning or merely better behavioral heuristics. We use linear probing and targeted perturbations in both imitation learning and reinforcement learning models to track when these internal signals emerge, plateau, or fail. Despite good closed-loop performance, policies often fail to form timely surrounding-vehicle predictions during near-collision events, revealing a limitation in the predictive signals available for ego planning. Finally, causal intervention shows that correcting mistaken predictions improves ego planning toward safer trajectories.
[AI-93] DDIAgents : Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction
链接: https://arxiv.org/abs/2606.31085
作者: Zhenqian Shen,Yu Liu,Xiaoyi Fu,Quanming Yao
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Drug-drug interaction (DDI) prediction is essential for medication safety, yet it requires reasoning over heterogeneous biomedical evidence whose relevance changes across interaction mechanisms. We propose DDIAgents, a mechanism-conditioned multi-agent framework that performs DDI prediction through dynamic knowledge orchestration. Given a drug pair, a planner agent instantiates specialized expert agents, routes mechanism-relevant knowledge sources to each agent, and aggregates their analyses through a conclusion agent. By adapting context flow to the inferred interaction mechanism, DDIAgents reduces irrelevant information, supports complementary expert reasoning, and produces interpretable agent-level rationales. Extensive experiments on realistic DDI prediction benchmarks show that DDIAgents consistently outperforms existing feature-based, graph-based, LLM-based, and agent-based baselines. Beyond prediction performance, DDIAgents demonstrates how multi-agent systems can organize heterogeneous scientific knowledge for adaptive and interpretable AI4Science reasoning.
[AI-94] Beyond But-for Test: Counterfactual Explanation in Abstract Argumentation via Actual Causality (Extended Version)
链接: https://arxiv.org/abs/2606.31080
作者: Siyi Liu,Muyun Shao,Beishui Liao
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the International Conference on Computational Models of Argument (COMMA 2026)
Abstract:Counterfactual explanation in abstract argumentation calls for an answer to the what-if query: would the topic argument still be accepted if the status of certain other arguments were changed? Existing approaches are limited to the but-for test and fail to accommodate more refined counterfactual conditions. To overcome these limitations, we introduce an intervention-based counterfactual reasoning framework in abstract argumentation. Our approach encodes the acceptance conditions of arguments as equations, then defines an intervention operator that supports (1) changing sets of arguments simultaneously, and (2) fixing witness arguments to their actual labels. Guided by the refined counterfactual condition introduced in the Halpern-Pearl definition, our method goes beyond the but-for test, thereby correctly identifying causes in argumentation structures such as Preemption and Overdetermination. Through comparison, we show that our method surpasses prior methods in both expressiveness and reliability.
[AI-95] Knowledge Distillation from Large Reasoning Models to Compact Student Models: A Case Study on the John O Bryan Mathematics Competition
链接: https://arxiv.org/abs/2606.31048
作者: Gaurab Baral,Aaditya Khanal,Yangyang Tao,Junxiu Zhou
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 7 tables. Code and data available at this https URL
Abstract:This paper investigates knowledge distillation from a large reasoning model (DeepSeek-R1) to a compact student model (Qwen2.5-7B). Using historical problems from the John O’Bryan Mathematics Competition at Northern Kentucky University (2011-2025), we build a Chain-of-Thought (CoT) training corpus through a dual-agent framework. The dataset is used to fine-tune the student model with Low-Rank Adaptation (LoRA) on Apple Silicon hardware using the MLX framework. The base Qwen2.5-7B model achieves 64.67% accuracy on competition problems, while the DeepSeek-R1 teacher achieves 91.40%. An initial 1,000-iteration training run revealed severe overfitting, with validation loss reaching a minimum at iteration 200 before rising steadily. Based on this finding, we ran five independent training runs each limited to 200 iterations with varied random seeds to assess result stability. Across these five runs, the fine-tuned student model achieves a mean accuracy of 69.43% (std dev 0.17%) on the competition dataset, a 4.76 percentage-point improvement over the base model, and generalizes to 73.1% (std dev 0.18%) on the MATH-500 benchmark. We further study how response length affects answer quality across six reasoning levels (R1-R6): accuracy declines consistently from 69.43% at R1 (mean 220 words) to 41.9% at R6 (mean 31.2 words), with the two-person speed section most sensitive to token reduction. These results demonstrate that CoT distillation improves compact student models and that response length is a critical factor in mathematical reasoning quality.
[AI-96] OpenLife: Toward Open-World Artificial Life with Autonomous LLM Agents
链接: https://arxiv.org/abs/2606.31046
作者: Atsushi Masumori,Itsuki Doi,Norihiro Maruyama,Ryosuke Takata,Takashi Ikegami
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ALIFE 2026
Abstract:Artificial life has explored life-like behavior on many computational substrates, but mostly in researcher-designed closed worlds. We argue that large language model (LLM) agents, with persistent memory, tool use, network access, and payment, now make it possible to move artificial life into the open social, technical, and economic world, a paradigm we call open-world Artificial Life (open-world ALIFE). Our proof-of-concept, OpenLife, surrounds a stateless LLM not with a single “smart agent” but with a society of asynchronous processes: memory, perception, evaluation, and a budget-based metabolism that makes persistence normative. With no fixed objective available, experience is appraised by open-vocabulary LLM judgment rather than scalar reward, and memory is rewired by meaning rather than frequency. Running six such agents in the open world for about twelve weeks and counting, we report the life-like dynamics that emerge: a shift from reactive to spontaneous activity, individuation into distinct agents, emergent social structure, and a first self-earned external income. We do not claim OpenLife has realized artificial life, but that open-world ALIFE is now a viable experimental paradigm and a concrete platform for studying what might cautiously be called living AI.
[AI-97] LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents
链接: https://arxiv.org/abs/2606.31045
作者: Jingpu Yang,Fengxian Ji,Zhengzhao Lai,Zhexuan Cui,Guangxian Ouyang,Qian Jiang,Fan Zhang,Min Peng,Qianqian Xie,Preslav Nakov,Zhuohan Xie
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Scientific embodied agents are increasingly capable of carrying out laboratory procedures, but executing these procedures safely in dynamic laboratory environments remains challenging. Current safety approaches often overlook the intermediate step of transforming laboratory natural language, including safety rules, manuals, protocols, and standard operating procedures, into machine-checkable runtime constraints. We introduce LabGuard (Laboratory Guard), a language-to-execution safety suite that grounds natural-language laboratory rules into executable specifications and deploys them as runtime guards. LabGuard includes three core components: LabGuard-IR, which defines a typed executable representation; LabGuard-Bench, which provides 812 supervised annotations expanded from 203 seed laboratory rules; and LabGuard-Grounder, which maps natural-language laboratory rules into LabGuard-IR. The resulting IR instances are handled by the LabGuard Pipeline, which compiles them into runtime monitors and applies them at the controller boundary. Experiments show that LabGuard generalizes to unseen laboratory-rule sources, achieves 79.4 task-scope F1, and reduces unsafe events from 39.5% to 23.8% after monitor compilation. In LabUtopia, its runtime monitors integrate with ACT, keeping interventions below 0.5% while preserving task success.
[AI-98] LLM -Driven Personalities for Decision Making in Emergency Simulations
链接: https://arxiv.org/abs/2606.31038
作者: Stefano Calzolari,Rubens Montanha,Gabriel Schneider,Gustavo Wide,Paulo Knob,Francesco Strada,Andrea Bottino,Soraia Raupp Musse
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:For virtual humans to appear believable, they must exhibit agency and spatial awareness while interacting with their environment in ways that reflect competence and intelligence. At the core of these capabilities lies effective decision-making, which strongly shapes agent behavior. With the rapid advancement of artificial intelligence, Large Language Models (LLMs) have increasingly been explored as a mechanism to support such decision-making processes. In this work, we investigate the use of LLMs to drive decision-making in virtual humans within a simulated evacuation scenario, incorporating OCEAN personality traits into agent representations. Our goal is to evaluate how personality, expressed through language-based prompts, influences both individual behaviors and collective simulation outcomes. Our results demonstrate that LLM-driven personality profiles significantly impact agents’ decisions, leading to distinct behavioral patterns across different traits. These findings suggest that heterogeneous crowds composed of LLM-guided agents can enhance the realism and variability of simulated environments, offering a flexible alternative to traditional rule-based approaches.
[AI-99] OTCache: Optimal Transport for Geometry-Aware Caching in Diffusion Models ECCV2026
链接: https://arxiv.org/abs/2606.31026
作者: Huanlin Gao,Fang Zhao,Qiang Hui,Fuyuan Shi,Shaoan Zhao,Yantao Li,Chao Tan,Ting Lu,Yuren You,Kai Wang,Shiguo Lian
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ECCV 2026
Abstract:We propose OTCache, a training-free framework for accelerating diffusion sampling via caching schedule prediction. Existing graph-based caching methods reduce redundant computation by optimizing shortest-path objectives, but rely on an additive independence assumption, which often breaks down in the low NFE regime. To address this issue, OTCache models caching schedules across inference budgets as a smooth evolution in policy space, inspired by Optimal Transport (OT). The framework consists of three stages: (1) obtaining a high-fidelity \textbfreference schedule using a graph-based caching method under a conservative budget; (2) performing a lightweight anchor search under an extreme low-budget setting via Optuna optimization with an end-to-end perceptual objective; and (3) predicting schedules for target budgets via quantile interpolation between the reference and anchor policies using continuous warping representations. Experiments on FLUX.1 [dev], Qwen-Image, and HunyuanVideo show that OTCache achieves 4.5x, 4.7x, and 3.66x acceleration, respectively, while consistently improving generation fidelity over state-of-the-art caching baselines. This work provides a new perspective on accelerating diffusion models through Optimal-Transport-inspired schedule modeling. Code:this https URL
[AI-100] A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management
链接: https://arxiv.org/abs/2606.30997
作者: Ramin Pishehvar
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present a three-phase deep reinforcement learning system for personalized portfolio management that addresses three limitations shared by all prior financial RL work: 1) ticker lock-in, 2) monolithic objectives , and 3) static user models. Phase 1 pretrains a ticker-identity-free cross asset encoder via self-supervised learning on a multi-asset corpus, augmented by a frozen parallel branch using Chronos, a T5-based time series foundation model, fused via a learned gating mechanism. To our knowledge, this is the first application of a time series foundation model to portfolio management RL. The encoder generalizes to any publicly traded asset via a 50-dimensional observable metadata vector that requires no retraining for new tickers. Phase 2 fine-tunes a MoE (Mixture of Experts) portfolio actor critic with PPO under an objective-conditioned reward that simultaneously serves six distinct investment goals sampled per episode: short-term alpha, short-term gain, long-term gain, capital preservation, tax-loss harvesting, and long-term-gains-only. A MoE architecture assigns each objective to a specialized expert head (momentum, growth, defensive, tax-aware), and a learned intent router blends experts based on the active objective and current market regime, which eliminates cross-objective gradient conflict. Phase 3 adds a lightweight personalization layer further adapted at inference time to each individual via a 76-parameter LoRA module fine-tuned on real brokerage transaction history, inferring investment objectives from revealed trading behavior rather than questionnaires. A natural language intent parser converts free-form goals directly into structured investment objective parameters.
[AI-101] When Regulation Has Memory: Hysteresis and Control Burden in Artificial Agency
链接: https://arxiv.org/abs/2606.30975
作者: Veronique Ziegler
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 8 figures
Abstract:Adaptive agents are usually judged by what they do, but an agent can appear stable while the internal effort required to keep it stable is increasing. This hidden regulatory burden matters for artificial agents operating under noise, delay, or changing demands: two systems may reach similar internal states while one requires much more corrective control to get there. Here, we study whether that burden depends on history. Using a computational model of adaptive uncertainty regulation, we drive an artificial agent through a continuous change in its uncertainty target and then reverse the change without resetting the agent. This creates a simple test for carryover: does the controller respond only to the current target, or does the path by which the agent reached that target still matter? The simulations show a clear history-dependent effect. The adaptive gain required to regulate the agent forms a reproducible hysteresis loop, meaning that the same target can require different levels of control depending on whether the agent is moving toward or returning from a more demanding regime. The timing of regulation also matters. When stabilization is available before disturbance exposure, the agent generally requires less adaptive gain than when it can only recover after disturbance has already acted. The state-level coherence measure also shows path dependence, but the timing effect is much clearer in regulatory gain. The main difference is therefore not that anticipatory regulation produces a completely different state. Rather, it reaches comparable regulated behavior with lower modeled control demand. These results suggest that adaptive agents should be evaluated not only by whether they remain organized, but by how much regulation they must recruit to do so. Comments: 16 pages, 8 figures Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.30975 [cs.AI] (or arXiv:2606.30975v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.30975 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Veronique Ziegler [view email] [v1] Mon, 29 Jun 2026 23:19:26 UTC (2,379 KB)
[AI-102] Agent Bound: Verifiable Behavioral Governance for Autonomous AI Agents
链接: https://arxiv.org/abs/2606.30970
作者: Anuj Kaul,Qianlong Lan,Pranay Gupta
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Autonomous AI agents increasingly perform consequential actions on behalf of human principals, including financial transactions, external communications, and enterprise workflows. Existing agent infrastructure relies on identity federation and delegated authorization to authenticate workloads and control resource access, but it cannot determine whether an authorized action should be executed under the current behavioral and operational context. We present AgentBound, a runtime governance framework that provides verifiable behavioral oversight for autonomous AI agents. AgentBound evaluates each proposed action using three independent authorities: delegated authorization, owner-signed behavioral constitutions, and site action contracts. Their judgments are conservatively composed through a formal decision model to determine whether an action should be permitted, reviewed, or denied before execution. To provide accountability, AgentBound generates cryptographically verifiable governance receipts that bind every action to the exact delegation, policy, and semantic artifacts governing the decision, enabling independent replay verification and policy provenance. The framework also introduces standing delegation for long-running agents, allowing periodic workloads to operate under continuously refreshed governance policies while preserving revocability and bounded authority. We present the formal foundation, system architecture, governance receipt protocol, and AgentBound-Bench, a benchmark framework for evaluating governance correctness, authority composition, and accountability. Rather than replacing model alignment, AgentBound complements it by providing a deterministic governance layer between authorization and execution, transforming governance from a process that must be trusted into one that can be independently verified. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2606.30970 [cs.AI] (or arXiv:2606.30970v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.30970 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Qianlong Lan [view email] [v1] Mon, 29 Jun 2026 23:12:06 UTC (7,137 KB)
[AI-103] Loc2Repair: A Framework for Evaluating the Impact of File-Level Issue Localization in Repo-Level LLM Repair ECAI2026 IJCAI
链接: https://arxiv.org/abs/2606.30963
作者: Mohammad Nour Al Awad,Sergey Ivanov
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To appear in the Proceedings of the Generative Code Intelligence Workshop (GeCoIn 2026), co-located with the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), Bremen, Germany, August 15–17, 2026
Abstract:Repository-grounded automated repair is often reported as a single end-to-end capability, which hides distinct failure modes such as poor file targeting, incorrect patch synthesis, and failed iterative debugging. We present Loc2Repair, a modular evaluation framework for controlled analysis of repository-grounded repair pipelines, and use it to isolate file-level issue localization as an upstream variable. Loc2Repair decouples localization and repair under a shared runtime, artifact schema, and evaluation harness, allowing researchers to combine different localization models and repair backbones under matched conditions. Using three repair backbones on SWE-bench Verified, we compare baseline repair without explicit localization, repair guided by predicted localization from two localizers, and repair guided by gold modified-file sets. Explicit localization consistently improves resolved rate across all backbones: pooled performance increases from 44.7% for baseline repair to 48.9% and 49.1% with predicted localization, and to 52.4% with gold localization. Localization also reduces mean elapsed time overall: in pooled paired analysis, mean elapsed time decreases by 100.94 s and 52.25 s for the two predicted-localization settings, and by 154.45 s with gold guidance, although token effects remain heterogeneous across models. Overall, Loc2Repair shows file-level localization is a consistent repair lever, improving effectiveness and mean latency in pooled analysis, while gold-guided failures expose headroom beyond localization.
[AI-104] Neuro-Bayesian-Symbolic Residual Attention Shallow Network: Explainable Deep Learning for Cybersecurity Risk Assessment
链接: https://arxiv.org/abs/2606.30953
作者: Nicolaie Popescu-Bodorin,Madeleine Togher
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 19 pages, Submitted to IJCCC (Online ISSN 1841-9844, ISSN-L 1841-9836), June 25, 2026
Abstract:We introduce the Neuro-Bayesian-Symbolic Residual Attention Shallow Network (NBS-RASN), a hybrid neural architecture for explainable cybersecurity risk assessment in open-source ecosystems. Unlike deep models that trade interpretability for accuracy, our shallow network encodes domain knowledge, causal reasoning, and expert judgment as differentiable components. It uses 80 interpretable neurons across 12 layers, including a gatekeeper that enforces five epistemological axioms - precision, causality, falsifiability, transparency, and completeness - as hard constraints before propagation. Despite limited depth, the network exhibits deep-learning traits via residual attention and feedback loops, learning complex risk patterns without becoming a black box. It produces fully decomposable scores: a deterministic weighted component plus an expert adjustment, with each adjustment traceable to named amplifiers (blast radius, propagation speed, structural nature, default exposure, exploitation pattern, institutional criticality). We validate on 20 open-source projects covering all OWASP Top 10:2025 categories and language risk classes, achieving confidence scores of 0.79-0.97, and show that explainability is guaranteed by design, not by a training algorithm. This challenges the assumption that deep learning requires deep networks, proving that shallow networks with deep reasoning can outperform opaque models in high-stakes cybersecurity, where interpretability is essential.
[AI-105] AgRefactor: Self-Evolving Agent ic Workflow for HLS Compatibility and Performance
链接: https://arxiv.org/abs/2606.30949
作者: Yang Zou,Zijian Ding,Yizhou Sun,Jason Cong
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:High-Level Synthesis (HLS) provides a fast path from concepts to silicon, but converting real-world software into synthesizable HLS code remains challenging due to restrictive language support and the gap between software and hardware programming practices. Existing automated and LLM-based refactoring approaches partially address this problem, yet they often lack flexibility, struggle to scale, and incur high computational costs. We introduce AgRefactor, an LLM-based multi-agent workflow for refactoring software into HLS-compatible programs. AgRefactor incorporates a self-evolving memory system that accumulates and retrieves factual and strategic knowledge across tasks, improving robustness and efficiency on unseen programs. To reduce cost and enhance scalability, it integrates automated refactoring tools, enabling agents to balance LLM-driven rewrites with efficient tool-based transformations. On 9 out of 11 challenging real-world benchmarks, which are 5-10x longer than the most complex cases studied in prior work, AgRefactor outperforms or matches the state-of-the-art automated refactoring tool and a strong LLM-based baseline built on the same framework backbone. Further agentic performance optimization yields a 6.51x geometric mean speedup over the SoTA pragma tuning tool and a 1.20x speedup over optimized open-source designs with less than 20% extra resources. AgRefactor is fully-automated and open-sourced.
[AI-106] Motion Planning in Compressed Representation Spaces
链接: https://arxiv.org/abs/2606.30940
作者: Lukas Lao Beyer,Sertac Karaman
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: To appear in the Proceedings of the 43rd International Conference on Machine Learning
Abstract:Deep learning methods have vastly expanded the capabilities of motion planning in robotics applications, as learning priors from large-scale data has been shown to be essential in capturing the highly complex behavior required for solving tasks such as manipulation or navigation for autonomous vehicles. At the same time, model-based planning algorithms based on search or optimization remain an essential tool due to their flexibility, efficiency, and the ability to incorporate domain knowledge via expert-designed algorithms and objective functions. We propose a new generative framework to unify these two paradigms. First, we learn an autoencoder with a high compression ratio and a latent space of hierarchically ordered, discrete-valued tokens. Leveraging both the dimensionality reduction and the hierarchical coarse-to-fine structure learned by this autoencoder, we then perform motion planning by directly searching in the latent space of tokens. This search can optimize arbitrary objective functions specified at test time, providing a large degree of flexibility while maintaining efficiency and producing realistic solutions by relying on the generative capabilities of the highly compressed autoencoder. We evaluate our method on nuPlan and the Waymo Open Motion Dataset, showing how latent space search can be used for a variety of guided behavior generation tasks, achieving strong performance for closed-loop motion planning and multi-agent guided scenario synthesis without requiring any task-specific training.
[AI-107] Physics-informed Conditional Normalizing Flows for Angles-only Cislunar Orbit Determination
链接: https://arxiv.org/abs/2606.30936
作者: Walther Litteri,Massimiliano Vasile
类目: Machine Learning (cs.LG); Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative Astrodynamics is advanced in this work by extending generative modelling to an orbit determination problem in the cislunar environment. The task is formulated as conditional density estimation, aiming to infer the probability distribution of the initial state from angles-only measurements over short observation arcs. A normalising flow is trained on perturbed topocentric observations from Near Rectilinear Halo Orbits, enabling a flexible and potentially multimodal posterior representation. Given new measurements, the learned density is sampled to generate statistically consistent and physics-informed state hypotheses. These estimates are refined via nonlinear least-squares minimisation, providing a competitive warm start for classical algorithms.
[AI-108] Behavior Cloning is Not All You Need: The Optimality of On-Policy Distillation for Noisy Expert Feedback
链接: https://arxiv.org/abs/2606.30923
作者: Ved Sriraman,Peihan Liu,Daniel Hsu,Adam Block
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:Imitation Learning is a natural framework for learning in sequential decision-making systems and has emerged as the dominant paradigm through which we understand language model training. A central puzzle is that, while in theory offline IL can be horizon-free and optimal, in practice online methods such as on-policy distillation often outperform offline methods such as supervised fine-tuning. We propose a noisy expert model to explain this gap, in which the learner only has access to a noisy version of the expert’s policy, but wishes to compete against the reward achieved by a clean expert, motivated by the fact that in many applications, e.g. training language models to perform long chains of thought, the expert is often imperfect. In this setting, we show a sharp separation between offline and online IL. Offline learning from noisy trajectories is fundamentally hard: to compete with the clean expert, the sample complexity must grow exponentially, in contradistinction to the clean expert setting where no explicit horizon dependence exists. In contrast, we prove that online interaction with the noisy expert via a novel variant of OPD enables polynomial dependence on the horizon in general. We further show that, under a natural condition on the expert noise distribution, which we show to be necessary for any horizon-free sample complexity, one can obtain such a guarantee, although our proposed algorithm sacrifices statistical efficiency in its dependence on the size of the policy class. Our analysis leads to an alternative loss function that is commonly considered empirically for LM training. We further provide algorithms and lower bounds, and extend our results to the more realistic setting of unknown corruption when the clean expert is deterministic, thereby providing a theoretical foundation for why OPD can outperform SFT when training language models from imperfect teachers.
[AI-109] Budget-Adaptive Routing: Skipping the Weak When the Strong Answers Anyway
链接: https://arxiv.org/abs/2606.30919
作者: Wei Geng,Nitinder Mohan,Jörg Ott
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 9 pages, 6 figures. To appear in the ACM SIGCOMM 2026 Workshop on Networks for AI Computing (NAIC '26), Denver, CO, USA
Abstract:Edge-cloud inference collaborations are often designed with a routing estimator that decides whether to offload each frame from weak models at the edge to stronger models in the cloud. Existing systems place the routing estimator after the weak detector, so the weak forward pass still runs even on frames that are later offloaded. In this paper, we argue that this weak-conditioned design can be suboptimal when the offload budget varies. First, we present a competitive weak-skipping estimator (0.153 GFLOPs, about 29x lighter than the weak detector at 4.49 GFLOPs) that extracts routing signal from raw pixels, outperforming the common after-weak placement weak-conditioned baselines. Second, we show that neither weak-skipping nor weak-conditioned placement dominates across the full operating curve, and we propose budget-adaptive routing, which selects between them by offload budget via two offline-tuned thresholds. On PASCAL VOC, our budget-adaptive router traces the upper accuracy envelope of both fixed placements across the operating range. Our method reduces per-frame latency by up to 19.1 ms (about 30% lower at rho = 0.9). Besides outperforming SOTA methods, it is surprisingly stronger than the strong model (+1.7 pp over the strong model’s peak mAP) at some operating points with far less compute. Artifacts are available at this https URL
[AI-110] Investigating Multi-Agent Deliberation in Law
链接: https://arxiv.org/abs/2606.30906
作者: Cor Steging,Ludi van Leeuwen,Tadeusz Zbiegień
类目: Artificial Intelligence (cs.AI)
备注: This manuscript has been accepted for presentation at the AIDA2J Workshop during the 21st International Conference of AI Law in Singapore, June 8 2026
Abstract:Artificial Intelligence is increasingly applied to the field of law, and has the potential to increase access to justice. One particular movement that is gaining traction is that of agentic AI, wherein AI agents, based on Large Language Models (LLMs) can take autonomous actions. In particular, multi-agent approaches in the legal domain remain largely unexplored. In this paper, we investigate multi-agent deliberation methods for legal reasoning tasks using LLMs. We explore multi-agent deliberation (MAD) and introduce two novel multi-agent frameworks inspired by courtroom procedures and legal argumentation. Our experiments on both legal and non-legal benchmarks reveal that multi-agent frameworks achieve comparable overall performance to baseline large language models, but produce significantly distinct answers. Notably, these approaches can successfully solve cases that the baseline fails to address, and vice versa. We conduct a qualitative evaluation and highlight scenarios where multi-agent frameworks outperform monolithic approaches. For example, multi-agent approaches appear better suited for answering questions that require critical thinking from multiple perspectives. Our work positions multi-agent systems as a promising direction for AI in the legal domain, while demonstrating the potential of law-inspired multi-agent approaches for deliberation.
[AI-111] How Human Feedback Shapes AI-generated Community Notes
链接: https://arxiv.org/abs/2606.30905
作者: Soham De,Isaac Slaughter,Jiawei Guo,Qiao-Yun Cheng,Jiayuan Yan,Sruti Banerjee,Martin Saveski
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Community Notes, a bridging-based crowd-sourced fact-checking system, has emerged as a new mechanism for moderating misleading information on social media and has been adopted by major platforms including X, Facebook, Instagram, Threads, and TikTok. Since its introduction, there has been an open question about what role AI could play in scaling and optimizing the system. Recently, X extended its Community Notes system by introducing Collaborative Notes: notes initially drafted by an LLM and iteratively refined based on feedback from human contributors. In this work, we systematically analyze the complete corpus of 19,146 collaborative notes and 211,850 instances of human feedback. First, we develop a taxonomy of human suggestions for improving AI-generated note drafts and find that suggestions involving factual corrections and additional context are most likely to be incorporated, while subjective policy judgments rarely are. Second, we examine changes in helpfulness across versions of collaborative notes and find that human feedback leads to more helpful notes, with the greatest impact coming from suggestions that challenge the main claim in the previous draft, particularly when submitted by more active contributors. Finally, we find that although collaborative notes improve through human feedback, they reach helpful status and are shown on the platform at lower rates than human-only or AI-only notes, with limited human participation emerging as a key bottleneck. Nevertheless, rather than serving as a weaker substitute, collaborative notes tend to play a complementary role, predominantly targeting posts that do not attract human-only or AI-only notes. Our analysis provides an initial description of efforts to use AI to improve crowdsourced content moderation in a real-world moderation system and outlines pathways for future improvements to such features.
[AI-112] Curvature-Guided Module Localization for Low-Rank Detoxification of Backdoored Large Language Models
链接: https://arxiv.org/abs/2606.30899
作者: Arash Raftari,Mehrdad Mahdavi,Nathan Blackthorn,Andrew Arash Mahyari
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Backdoor attacks pose a serious threat to large language models (LLMs) by causing otherwise benign systems to produce attacker-specified malicious behavior when a hidden trigger is present. In this work, we study post hoc detoxification of backdoored LLMs in a practical setting where the defender has access to the poisoned model but does not wish to retrain the full network from scratch. We propose a mechanistically guided weight-space repair framework that first localizes modules involved in propagating trigger-induced behavior using activation patching and Fisher/K-FAC curvature analysis, and then applies targeted low-rank repair to only the most influential modules. We evaluate the method on poisoned variants of \textttLlama-3.2-1B-Instruct with triggers inserted at the beginning, middle, and end of otherwise benign prompts. Results show that the proposed approach substantially suppresses trigger-conditioned malicious responses while preserving benign model behavior. These findings suggest that backdoor removal in LLMs can be formulated as a localized structural repair problem rather than only a broad behavioral alignment problem.
[AI-113] Beyond expert users: agents should help users construct preferences not just elicit them
链接: https://arxiv.org/abs/2606.30863
作者: Irena Saracay,Ludwig Schmidt,Carlos Guestrin
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Agents typically assume an expert user – one with well-formed preferences about what they want – and default to clarifying questions whenever the task is underspecified. We argue this assumption is unrealistic. Users often lack the domain knowledge to have completely specified preferences; if asked about their preference on some feature, the user may be unable to answer without the agent helping the user to learn some domain knowledge needed to form a preference for that feature, e.g., via examples or explanations. To formalize these principles, we draw on the Search-Experience-Credence framework from Information Economics to introduce CoPref, a model of how users construct preferences based on agent dialog actions. We then study these ideas concretely in agentic recommender systems, proposing CoShop, an interactive benchmark. In CoShop, an agent converses with and makes recommendations for a CoPref user. The agent’s performance depends on whether it can help the user gain the knowledge needed to specify the task well. Evaluating five frontier models, we find that no agent exceeds 56% accuracy on CoShop despite five turns of interaction. Failures stem not from agents’ ability to find items, but from how little the interaction expands what users know about what they want.
[AI-114] BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation
链接: https://arxiv.org/abs/2606.30850
作者: Ankur Samanta,Akshayaa Magesh,Tal Lancewicki,Ayush Jain,Youliang Yu,Paul Sajda,Kaveh Hassani,Aditya Modi,Daniel R. Jiang,Yonathan Efroni
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) are typically deployed in multi-turn conversations, where each turn provides new evidence that should reduce epistemic uncertainty about their environment. Acting rationally then requires inferring the unobserved quantities that govern it and updating beliefs about them as evidence accumulates. Yet most evaluations only score the model’s final-turn answer in a single-turn format, leaving this process unexamined. We ask how closely LLMs’ belief updates match those of a rational Bayesian reasoner in multi-turn settings, and introduce BayesBench, a suite of simulation environments that probe this across three progressively complex tasks: (i) Bayesian estimation, where the model infers an unknown parameter from sequential evidence; (ii) Bayesian prediction, where the model turns inferred beliefs about a latent variable into outcome forecasts; and (iii) latent-framed Bayesian prediction, where observations are filtered through a user-persona framing, requiring joint inference over the latent state and the persona. Across seven LLMs (3B–70B), scaling improves latent inference and evidence accumulation, with updates occasionally matching the Bayesian posterior. However, these gains do not reliably carry over to downstream prediction, exposing a gap between inferring latent structure and using it to rationally update beliefs about the target outcome.
[AI-115] How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats Embeddings and Retrieval Strategies
链接: https://arxiv.org/abs/2606.30846
作者: Jhon G. Botello,Jose J. Padilla,Erika Frydenlund,Krzysztof Rechowicz,Eric Weisel
类目: Artificial Intelligence (cs.AI)
备注: Accepted for publication in Proceedings of the 2026 Winter Simulation Conference (WSC 2026). The final published version will appear in IEEE Xplore
Abstract:Discovering simulation models for reuse remains a fundamental challenge in Modeling and Simulation (MS). When many models coexist, identifying those that align with a given modeling intent remains difficult. Recent advances in Artificial Intelligence (AI), particularly retrieval-based approaches, offer a promising pathway to operate at this semantic layer. In this paper, we present an experimental study investigating the impact of data representation, transformer-based embedding models, and retrieval strategies on the discovery of simulation models using natural language queries. We evaluated performance across multiple query types using standard information retrieval metrics, including recall@5 and nDCG@5. Results show that data representation matters, open-source embedding models can achieve high performance, and reranking methods are important, especially as query complexity increases. This work provides a baseline for AI-driven model discovery and discusses its role in advancing toward AI-driven composability and interoperability.
[AI-116] Contrastive Reflection for Iterative Prompt Optimization KDD2026
链接: https://arxiv.org/abs/2606.30840
作者: Derek Koh,Jinghui Mo,Benjamin H. Le,Jiening Zhan,Baofen Zheng,Kevin Bevis,Nathaniel C. Owen,Lauren Elizabeth Charney,Wenqiong Liu,Jingwei Wu
类目: Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure. To appear at Agent4IR @ KDD 2026 (KDD 2026 Workshop on AI Agents for Information Retrieval)
Abstract:LLM agents are becoming central to information retrieval: they issue retrieval queries, synthesize answers, and increasingly serve as judges for IR evaluation. Improving the prompts that control these agents is an optimization problem, but in applied IR settings it often looks less like blind search and more like debugging. Engineers need to know which behavior failed, which nearby behavior still worked, what distinguishes the two, and whether a prompt edit improves held-out quality without introducing regressions. We present Contrastive Reflection, an iterative prompt-optimization framework for agentic IR workflows. The framework starts from a task-centric quality definition: QA agents expose retrieval or reasoning traces, and grading agents expose dimension-level scores and rationales. These structured traces are used to identify error-anchored behavioral slices, add nearby successful examples from the same region, and ask a Teacher LLM to propose a targeted prompt edit. Candidate edits are accepted only when validation performance improves, optionally subject to regression checks. We instantiate the framework with a tree-based slice selector, but the contribution is the contrastive reflection loop rather than the tree itself. On a public HotpotQA retrieval-augmented QA setup, one tree-selected contrastive repair improves held-out exact-match accuracy from 51.4% to 60.4%. Failure-only and random-evidence variants improve less and break more previously correct examples. A light instruction-only comparison places the method near modern prompt optimizers: MIPROv2 reaches 59.4% and GEPA 57.0%. The result is an interpretable optimization loop for IR agents, aimed at making prompt repair more inspectable and validation-driven. Comments: 6 pages, 1 figure. To appear at Agent4IR @ KDD 2026 (KDD 2026 Workshop on AI Agents for Information Retrieval) Subjects: Artificial Intelligence (cs.AI) ACMclasses: I.2.7; H.3.3; I.2.6 Cite as: arXiv:2606.30840 [cs.AI] (or arXiv:2606.30840v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2606.30840 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-117] A Stationary-Distribution Theory for Triplet-Based Plateau Search in Random Forest Ensemble-Size Selection
链接: https://arxiv.org/abs/2606.30837
作者: Andrey A. Dukhovny,Andrey M. Lange
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML)
备注: 34 pages, 4 figures, 2 tables
Abstract:The number of trees is a central computational parameter in Random Forests: increasing it reduces finite-ensemble variability but increases training and prediction cost. Plateau-based tuning adapts this parameter through local comparisons of out-of-bag scores at a geometric triplet of tree counts. After the remaining hyperparameters have stabilized, however, the central triplet point need not converge to a deterministic value; instead, it fluctuates around a stationary regime. This paper develops a stationary-distribution theory for this process. The central ensemble size B_t is modeled as a birth-death Markov chain on a geometric grid, and its stationary distribution is derived through local balance. Under a leading centered folded-normal approximation, equilibrium equations are obtained for the original update rule and a symmetric modified variant, implying that the stationary center B_=O(\varepsilon^-2) as \varepsilon\downarrow 0 . The stationary spread is also characterized. A local Gaussian approximation and a Fokker-Planck interpretation give grid-level variance constants. After conversion to the ensemble-size scale, \sigma_B,=O(\varepsilon^-2) , while the variance is O(\varepsilon^-4) . The leading relative spread is independent of \varepsilon and controlled by the scale factor and update rule. These results interpret plateau-based Random Forest tuning as a stochastic process rather than a deterministic stopping rule. Comments: 34 pages, 4 figures, 2 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Probability (math.PR); Machine Learning (stat.ML) MSC classes: 60J10, 68T05 ACMclasses: G.3; I.2.6 Cite as: arXiv:2606.30837 [cs.LG] (or arXiv:2606.30837v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30837 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[AI-118] AI-Generated PowerShell Malware: An Experimental Framework and Dataset
链接: https://arxiv.org/abs/2606.30819
作者: Luciano Pianese,Vittorio Orbinato,Pietro Liguori,Roberto Natella
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative AI has emerged as a significant cybersecurity threat, with several recent attack campaigns leveraging LLMs to generate code for malicious purposes via scripting languages such as PowerShell. Consequently, for cybersecurity analysts, it is imperative to investigate the offensive capabilities of AI code generators. In this paper, we propose an experimental framework to assess LLM-generated PowerShell malware, which comprises a novel sandbox approach for dynamic analysis of AI-generated malware. Furthermore, we present a novel, manually curated dataset of real-world PowerShell malware, annotated in natural language to assist the training and evaluation of LLMs. Finally, this study evaluates permissive, open-weight LLMs adapted to PowerShell malware generation. Our results reveal a high degree of similarity between real malware and LLM-generated ones in terms of triggered OS malicious events, with a median Jaccard index of 84.5% and 48.4% of instances achieving complete overlap.
[AI-119] Gradient Smoothing: Coupling Layer-wise Updates for Improved Optimization ICML2026
链接: https://arxiv.org/abs/2606.30813
作者: Haoming Meng,Anton Sugolov,Vardan Papyan
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Published in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
Abstract:Deep neural networks with repeated architectural blocks, such as transformers, often exhibit structured relationships across layers that emerge during training. Motivated by this observation, we introduce \emphDepth-wise Gradient Augmentation, a general optimization paradigm in which the update applied to each layer is obtained by transforming the collection of block-wise optimizer updates along the depth dimension. Within this framework, we study \emphGradient Smoothing, a family of depth-wise smoothing methods, and instantiate it with a simple local \emphWindow Smoothing operator. The resulting method operates directly on block-wise updates produced by arbitrary base optimizers (e.g., SGD, Adam, Muon), incurs minimal computational overhead, and is compatible with existing optimization pipelines. We evaluate Gradient Smoothing across a diverse set of architectures and training regimes, including language model pretraining, RL post-training of LLMs for reasoning, diffusion modeling, and image classification with Vision Transformers. Across these settings, Gradient Smoothing consistently improves optimization and generalization performance without modifying model architectures or training objectives. We further show that it promotes more structured representation evolution across depth, consistent with its interpretation as a structured depth-wise preconditioning method. Together, these results establish Depth-wise Gradient Augmentation as a promising framework for exploiting cross-depth structure in optimization and demonstrate Gradient Smoothing as a simple and broadly applicable instantiation.
[AI-120] Security–Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense ICML2026
链接: https://arxiv.org/abs/2606.30783
作者: Mitchell Hermon,Rahul Gupta,Weitong Ruan,Ekraam Sabir,Haohan Wang
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 34 pages, 4 figures
Abstract:We identify a security-fidelity tradeoff in defending LLMs against indirect prompt injection: defenses resist injected instructions largely by suppressing untrusted text, which corrupts tasks that must preserve it, such as translation and document editing. Attack-success metrics cannot see this, because a model that ignores an injection and one that faithfully processes it as data score identically. We introduce SecFid, a benchmark built so that executing an injection, processing it as data, and ignoring it produce distinguishable outputs. This makes fidelity measurable and exposes a frontier: across 1,168 examples and 48 configurations, no model or defense achieves both objectives. The highest-fidelity model reaches 96.5% fidelity at 47.8% security, while the most secure defenses invert this, at 99.3% security but only 71.0%-73.9% fidelity. Even defenses with identical security differ in how they earn it: some repair hijacks into faithful processing, others simply suppress benign content. A decision-theoretic analysis shows why no fixed choice can be right everywhere: the correct behavior is not a property of the defense but of the deployment, set by its relative cost of a hijack versus a dropped span. Security alone therefore measures only half of robustness, and reporting it without fidelity hides the price at which it was bought.
[AI-121] What Drives Interactive Improvement from Feedback? ICML2026
链接: https://arxiv.org/abs/2606.30774
作者: Bartłomiej Cupiał,Jan Łojek,Mikołaj Garstecki,Szymon Pobłocki,Alicja Ziarko,Piotr Miłoś
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 7 figures, accepted to the RLxF Workshop at ICML 2026
Abstract:We study when natural-language feedback produces improvement beyond the gains obtainable from repeated attempts alone. In multi-turn language agent setting, higher final accuracy can reflect useful feedback, but it can also arise from resampling, format correction, or additional test-time computation. To separate these effects, we introduce a controlled student-teacher protocol across Omni-MATH, Codeforces, BBEH Linguini, and ARC-AGI1, evaluating thirteen open-weight models in both student and teacher roles. We compare external feedback, self-feedback, and unguided self-refinement, while varying interaction history, task difficulty, and teacher access to privileged task information. Across settings, we find that multi-turn improvement is often not evidence of feedback use: self-generated feedback adds little beyond unguided self-refinement, whereas the strongest external teachers produce substantially larger feedback-specific gains, suggesting that useful feedback must provide guidance beyond generic retry. Dense student-teacher interaction matrices further show that interactive gains are driven more by the student’s ability to use feedback than by the teacher’s identity, although teacher choice remains important for a fixed student. These results suggest that feedback-based agents should be evaluated against repeated-attempt baselines, and that ability to act on feedback, not merely feedback availability, is a central bottleneck for interactive improvement. We release our controlled student-teacher evaluation framework at this https URL.
[AI-122] Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens
链接: https://arxiv.org/abs/2606.30755
作者: Peizhi Niu,Wenjie Qu,Shangding Gu,Tianneng Shi,Yuankai Li,Ahmad Tawaha,Hend Alzahrani,Vincent Siu,Boyi Li,Chenguang Wang,Jiaheng Zhang,Basel Alomair,Ming Jin,Muhao Chen,Chi Wang,Costas Spanos,Dawn Song
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Claw-like AI agents (e.g., OpenClaw) are always-on processes with persistent access to credentials, files, tools, and external services. They take on system-level responsibilities – installing packages, maintaining state, scheduling subtasks, and mediating I/O – making security failures far more severe than in other agents. Yet existing benchmarks focus on model responses and tool calls, leaving cross-component failure modes largely unmeasured. We adopt a computer-system analogy: treating a Claw-like agent as an agentic computer system whose gateway runtime plays an OS-like mediation role, whose Skills resemble user-installed applications, and whose Plugins resemble loadable extensions with runtime privileges. Each component has a classical counterpart whose protection mechanisms – refined over decades of cybersecurity research – are absent on the agent side. From this perspective, we develop SafeClawArena, a benchmark of 406 adversarial tasks across four attack surfaces (Skill Supply-Chain Integrity, Persistent State Exploitation, Cross-Boundary Data Flow, and Indirect Prompt Injection), executed in containerized replicas of real agent platforms with canary-marked credentials and evaluated via automated taint tracking across nine output channels. We evaluate three platforms (OpenClaw, NemoClaw, SeClaw) and five frontier LLMs. The highest attack success rate reaches 70%; malicious Plugins succeed in 100% of cases regardless of the LLM. SeClaw cuts GPT-5.4’s attack success rate from 70% to 22%, partly through utility-security tradeoffs rather than active defenses, while Claude-Opus-4.6 already sits near a 22% floor on every platform. These results expose the inadequacy of current defenses and suggest directions for future hardening. Code and data: this https URL.
[AI-123] Hierarchical Global Attention (HGA)
链接: https://arxiv.org/abs/2606.30709
作者: Woernle Frank,Fedosov Vladimir,Grinenko Artemiy
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Hierarchical Global Attention (HGA) is a drop-in replacement for dense causal attention in pretrained long-context transformers. HGA preserves the original checkpoint parameters: the pretrained W_Q , W_K , W_V , and W_O projections remain unchanged, no calibration parameters are introduced, and no retraining is required. Applied to Qwen3-30B-A3B-Instruct-2507-FP8 on a single RTX~5090 (32GB), the patched model runs out of the box at a 64K-token context, where token-level K/V storage is not feasible on this hardware. Unlike previous sparse-attention methods, HGA performs hierarchical two-level routing. It first retrieves relevant chunks using compact RoPE-aware summaries and then refines the selection by routing only the most relevant groups before performing exact token-level attention. This hierarchical retrieval significantly reduces the number of fetched tokens while preserving exact attention over the retrieved token set, making RAM- and NVMe-backed storage practical. The full historical token K/V resides in host RAM or NVMe storage, while only a small routed working set is transferred to GPU memory during attention. Consequently, GPU memory consumption depends primarily on model weights and the routed working set rather than on the total context length. Across all tested context lengths (4K - 64K tokens), routed attention remains within approximately 0.01 – 0.02 nats of dense attention while the sparsity used is just about 3%. These results suggest that the approximation introduced by hierarchical routing is small, and that the remaining quality gap is likely dominated by long-context positional encoding rather than by the routing algorithm itself. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: Primary: 68T45, Secondary: 68T07 Cite as: arXiv:2606.30709 [cs.LG] (or arXiv:2606.30709v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30709 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Vladimir Fedosov [view email] [v1] Mon, 29 Jun 2026 16:20:35 UTC (13 KB) Full-text links: Access Paper: View a PDF of the paper titled Hierarchical Global Attention (HGA), by Woernle Frank and 2 other authorsView PDFHTML (experimental)TeX Source view license Current browse context: cs.LG prev | next new | recent | 2026-06 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked="checked"class=“labs-tab-input”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); We gratefully acknowledge support from our major funders, member institutions, , and all contributors. About Help Contact Subscribe Copyright Privacy Accessibility Operational Status (opens in new tab) Major funding support from
[AI-124] Why Do Few-Step Text Latents Fail When Image Latents Work? Non-Commitment at Sharp Categorical Readouts
链接: https://arxiv.org/abs/2606.30705
作者: Zhongyao Wang
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deterministic few-step generation succeeds on continuous image latents but collapses to incoherent text on continuous text latents, and we show the cause is geometric rather than a training or scaling deficiency: a smooth, regularity-limited deterministic map cannot resolve a discrete branch choice before a sharp categorical readout, so few-step failure is governed by decoder sharpness, not transport accuracy. In the overlapping regime of real text autoencoders, we prove (Theorem 3) that the posterior-mean terminal step flips tokens at the rate of the latent mass in an O(s(t)) tube around decision boundaries. Two diagnostics, DABI (readout sharpness) and CCI (categorical commitment), measured on published checkpoints show that four independently built continuous-text decoders amplify a boundary-aligned perturbation far beyond a norm-matched isotropic one (DABI from 5\times10^2 to 10^5 ), while image decoders have DABI \approx 1 . Two mechanisms escape the continuous bound: categorical commitment (autoregressive decoders succeed despite sharper readouts) and stochastic re-injection (deterministic ODE at K=4 gives PPL 294 versus SDE 50 on the same model). In the idealized separated regime we prove matching sharp transport laws, including a dimension phase diagram: the deterministic stiffness needed to separate M modes grows as \Theta(\sqrt\log M) once the latent dimension is \Omega(\log M) (and as M^1/n in fixed dimension), with a depth- B hierarchy giving a \sqrtB -smaller per-step peak (Theorems 5-7); a coarea identity links these to the overlapping tube (Theorem 17). The result is an accuracy-depth-stiffness tradeoff: within the deterministic-continuous class the cost is irreducible, and both escapes step outside it.
[AI-125] Accelerometry-Derived Digital Biomarkers for Cardiometabolic Risk: A Population-Representative Tabular Benchmark with Uncertainty Quantification ICML2026
链接: https://arxiv.org/abs/2606.30702
作者: Federico Felizzi
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: Accepted at the SD4H Workshop at ICML 2026
Abstract:Structured tabular data dominates clinical medicine, yet existing benchmarks fail to reflect real-world properties like complex survey sampling, demographic oversampling, and subgroup fairness. We introduce the NHANES Accelerometry Cardiometabolic Benchmark, derived from NHANES 2003-2006, comprising 1,381 adults with hip-worn accelerometry, fasting laboratory biomarkers, dietary intake, and anthropometrics. We evaluate three tabular learning methods – ridge regression, XGBoost, and the foundation model TabPFN v2 – to predict glycated haemoglobin (HbA1c), fasting triglycerides, and C-reactive protein (CRP) from activity phenotypes and lifestyle covariates. TabPFN v2 achieves the best overall performance (HbA1c R^2=0.156, CRP R^2=0.383), while triglycerides remain largely unpredictable (R^2 0.05), consistent with known genetic dominance. We apply split conformal prediction to generate distribution-free 90% prediction intervals and evaluate demographic coverage equity across sex and race/ethnicity subgroups. Marginal coverage aligns with the 90% target for CRP and HbA1c but falls below for triglycerides. At the subgroup level, we observe localized undercoverage (e.g., HbA1c for Mexican American participants), illustrating the gap between marginal guarantees and the conditional coverage required for clinical fairness. Code and data are at this https URL.
[AI-126] An AI-Based Solution for Secure Service Provisioning in IoT
链接: https://arxiv.org/abs/2606.30701
作者: Marco Arazzi,Mert Cihangiroglu,Serena Nicolazzo,Antonino Nocera,Vinod P
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:As the Internet of Things (IoT) continues its rapid expansion, the attack surface grows accordingly, with emerging threats targeting smart objects and their interactions. In this evolving landscape, securing service provisioning is crucial to ensure the proper functioning, security, and reliability of the IoT ecosystem. Service provisioning encompasses key tasks such as device registration, configuration, authentication, authorization, and software deployment, all of which are essential for seamless and secure IoT operations. In this paper, we present a comprehensive framework designed to select the most suitable smart objects to deliver a target service within a given IoT environment while also monitoring the behavior of the entities involved during the service provisioning phase. To achieve this, we employ a Deep Reinforcement Learning (DRL) approach in which an intelligent agent learns, through interaction with a complex, dynamic environment, how to adapt to changes while adhering to predefined security constraints. For behavioral monitoring, we leverage Federated Learning (FL) to develop a global Behavioral Fingerprinting (BF) model that is fully distributed and can analyze how IoT devices interact within the network. In addition, the BF is used to compute a reliability score for each service provider, reflecting its degree of compliance with the defined security constraints. This score is then incorporated into the service provisioning process, allowing smart objects to select providers not only according to functional suitability but also to their reliability level. Finally, we conduct an extensive experimental evaluation to assess the robustness and scalability of our approach. The results demonstrate that our solution can be effectively deployed even on resource-constrained IoT devices, making it a viable and scalable security-enhancing mechanism for modern IoT ecosystems.
[AI-127] BEST-RQ-2: Contextualize-Then-Predict a Two-Step Approach for Self-Supervised Audio Representations
链接: https://arxiv.org/abs/2606.30700
作者: Ludovic K. Tuncay(IRIT-SAMoVA),Etienne Labbé(IRIT-SAMoVA),Thomas Pellegrini(IRIT-SAMoVA)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:
Abstract:Self-supervised learning enables audio representations that transfer across domains and tasks. We present BEST-RQ-2, an evolution of BEST-RQ that retains frozen randomprojection-based discrete targets while introducing a two-step contextualize-then-predict pretraining scheme. A ViT context encoder processes only the unmasked spectrogram regions, and a lightweight predictor infers targets for the masked regions; the predictor is discarded after pretraining. Replacing the original Conformer encoder with a ViT shifts performance across domains, slightly reducing speech performance while improving music and environmental sounds, with comparable average scores. The main improvement comes from decomposing masked prediction into separate contextualization and prediction stages. On the X-ARES and XARES-LLM benchmarks, BEST-RQ-2 consistently outperforms one-stage baselines in overall transfer while keeping inference compute unchanged. Code and model checkpoints are publicly available.
[AI-128] DSIP: A Dynamic Coordination Planner for Signal-Free Intersections using Diffusion-Model-Based Multi-Agent Motion Planning
链接: https://arxiv.org/abs/2606.30694
作者: Qian Hu,Haoyang Peng,Songan Zhang,Ming Yang,Hongtei Eric Tseng
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Traffic signal control at urban intersections inherently introduces stop-and-go behavior, resulting in increased delays and reduced traffic efficiency, especially under high traffic demand. With the emergence of connected and automated vehicles (CAVs), trajectory-level coordination has emerged as a high-potential strategy to augment or transcend conventional phase-based management. This paper proposes DSIP (Diffusion-model-based Signal-free Intersection Planner), a multi-agent motion planning framework driven by a generative diffusion process. DSIP shifts the intersection management paradigm from discrete temporal phasing to continuous multi-vehicle trajectory optimization. This work evaluates the theoretical upper-bound performance of this coordination strategy under idealized communication and execution conditions to isolate the core benefits of the diffusion-driven approach. Using the SUMO platform, we evaluate DSIP across diverse four-leg intersection configurations. Experimental results demonstrate that DSIP significantly reduces average delay and maintains higher average speed compared to both fixed-time signal control and state-of-the-art reinforcement-learning-based controllers, particularly in medium- to high-density traffic. These findings suggest that diffusion-based trajectory planning provides a scalable and robust foundation for future autonomous intersection management. By unlocking latent intersection capacity through software-defined coordination, this approach offers a cost-effective pathway to improve urban traffic flow efficiency without requiring physical infrastructure expansion.
[AI-129] Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM -Generated Code
链接: https://arxiv.org/abs/2606.30689
作者: Subham Panda
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Spec-Driven Development (SDD) frameworks guide Large Language Model (LLM)-powered code generation through formal specifications, yet they differ fundamentally in how they enforce traceability between requirements and generated code. This paper presents two controlled empirical studies comparing three SDD frameworks: traceSDD , which enforces mandatory per-line requirement citations using hierarchical REQ-XXX.Y.Z identifiers; Spec Kit , which uses artifact-level traceability through user stories and acceptance criteria; and OpenSpec , which relies on post-hoc external trace maps. We measure two primary outcomes across two frontier LLMs – Claude Sonnet 4.6 (N=20, 4 conditions, 240 implementations) and GLM-5-turbo (N=50, 4 conditions, 600 implementations): output determinism (lexical similarity across independent LLM sessions) and automated hallucination detection rate (TDR). Our pre-registered analysis reveals a consistent, cross-model replicated trade-off: the uncited condition produces significantly higher determinism than the cited condition (Claude: d=-0.76 , p=0.003 ; GLM: d=-0.72 , p0.001 ), while only the cited condition enables automated hallucination detection (TDR: Claude 86.4%, GLM 88.0%, vs 0% for all alternatives, FPR=0% across both studies). traceSDD (cited) significantly outperforms Spec Kit on determinism (Claude: d=0.47 , p=0.049 ; GLM: d=0.42 , p=0.003 ) but not OpenSpec (Claude: d=0.18 , p=0.44 ; GLM: d=0.14 , p=0.32 ). These findings establish that citation annotations trade determinism for verifiability, and that this trade-off generalizes across model architectures.
[AI-130] Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
链接: https://arxiv.org/abs/2606.30686
作者: Taozhao Chen,Ian Manchester,Huaming Chen
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language-Action (VLA) systems, built on pretrained vision-language models (VLMs), have shown rapidly improving performance on robot manipulation benchmarks. These gains are commonly interpreted as evidence that semantic representations learned from internet-scale data transfer to physical execution generalization. This position paper argues that the assumption underlying this interpretation – that semantic generalization is sufficient to support physical action decisions – has not been independently verified and cannot be tested under current evaluation protocols. We support this claim by decomposing VLA policies into semantic mapping and physical action decision, and showing that task success rate – the dominant evaluation metric – cannot distinguish between these two sources of capability. As a result, improvements in benchmark performance are consistent with multiple competing explanations, including semantic matching, distributional overlap, and genuine physical generalization. We further argue that this identifiability gap has been reinforced through narrative drift, whereby successive systems inherit and strengthen prior interpretations of performance gains without isolating the underlying causal mechanism. To address this limitation, we propose a research direction based on evaluation designs that introduce controlled variation to separately measure semantic and physical generalization. Such designs make it possible to causally attribute performance without requiring access to model internals, and to empirically assess the role of VLM backbones as semantic interfaces rather than implicit sources of physical competence. Our goal is not to refute the role of VLMs in robotics, but to clarify the conditions under which claims of physical generalization can be meaningfully evaluated.
[AI-131] ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models
链接: https://arxiv.org/abs/2606.30682
作者: Fengjie Lu,Chenang Jiang,Jiarui Hai,Helin Wang,Aaron Yee
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 7 pages, 3 figures
Abstract:Recent advances in language–audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio–caption matching, limiting their ability to support diverse retrieval objectives and controllable retrieval behaviors. We present ALM2Vec, a universal audio embedding framework derived from pretrained large audio–language models (LALMs). By transferring the audio understanding, instruction-following, and reasoning capabilities acquired through large-scale multimodal training, ALM2Vec learns a unified embedding space for retrieval across audio domains and task types. Beyond conventional text–audio retrieval, ALM2Vec incorporates natural-language instructions into the embedding process, enabling instruction-aware retrieval for scenarios such as audio question answering and aspect-conditioned retrieval. Experimental results show that ALM2Vec achieves competitive performance on standard audio and speech retrieval benchmarks while exhibiting promising compositional and controllable retrieval capabilities, highlighting its potential as a unified audio embedding model for retrieval across domains, tasks, and user intents.
[AI-132] Locker-based Truck-Drone Routing with Integrated Considerations of Pickups Deliveries and No-Fly Zones
链接: https://arxiv.org/abs/2606.30680
作者: Xuanyu Liu,Hui Hu,Jiao Zhao,Ziliang Wang,Zhengbing He
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Physics and Society (physics.soc-ph)
备注:
Abstract:Truck-drone delivery is an emerging last-mile logistics mode combining the long-haul capacity of trucks with the flexible service capability of drones. In locker-based operations, smart lockers serve not only as temporary parcel storage facilities but also as automated drone docking and service nodes. These automated nodes support drone takeoff, landing, parcel handover, and battery replacement, thereby significantly extending the service range and operational flexibility of drone-assisted delivery networks. However, practical locker-based delivery systems face complex real-world challenges, requiring the integrated coordination of not only parcel delivery, return pickup, battery-constrained and load-dependent drone flights, but also necessary detours around restricted airspace. To address this practical and multifaceted challenge, this paper introduces a locker-based truck-drone routing problem with integrated considerations of pickups, deliveries, and no-fly zones (LTDRP-PDNF), with the objective of minimizing the total operational cost of a fleet of drone-equipped trucks. We formulate the route construction process as a Markov Decision Process and develop a two-stage deep reinforcement learning-based neural heuristic. The first stage utilizes an attention-based encoder and a Bidirectional Gated Recurrent Unit decoder to solve the truck-only routing problem, formulated as a capacitated vehicle routing problem. The second stage combines a policy-transfer strategy with a hybrid dispatch assignment heuristic to construct fully coordinated truck and drone routes for LTDRP-PDNF. Experiments on instances of different scales demonstrate that the proposed method outperforms metaheuristic and neural heuristic baselines in most cases while maintaining exceptionally short computation times, offering an effective, scalable solution framework under practical operational constraints.
[AI-133] Local Pheromone Network: Sparse Local Learning with Multi-Scale Synaptic Trails Consolidation and Replay
链接: https://arxiv.org/abs/2606.30669
作者: Xingcheng Fu,Xianjun Chen,Zhihao Li
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 tables
Abstract:Backpropagation-trained dense neural networks are powerful function approximators, but they couple learning across many parameters and can overwrite previous associations when tasks conflict. This paper describes Local Pheromone Network, a small research prototype for sparse, local, manually updated neural networks. In Local Pheromone Network, each output unit reads only a fixed local neighborhood of input units subject to geometric distance and molecular-tag compatibility. Each synapse stores a weight, a short-term pheromone trace, a long-term pheromone trace, and an optional consolidation state. Training does not call automatic differentiation. Instead, every layer performs a pheromone-weighted Hebbian-style update on a budgeted subset of local synapses selected from local error and co-activity. The update budget adapts online: it shrinks when loss improves and expands toward recently active neighborhoods when loss worsens. Optional mechanisms add structural plasticity, local replay, output masks for partitioned learning, and a target-free local contrastive step. We present the implementation, learning rule, and preliminary experiments on synthetic regression, partitioned memory, conflicting memory, consolidated conflict, structural plasticity, replay, and a synthetic long-context hybrid memory task. The prototype learns local linear rules, preserves partitioned memories through tags and masks, reduces forgetting under consolidation, and uses replay under conflict.
[AI-134] ELEVATE: Designing Human-Centered GenAI Virtual Tutors for Scalable and Inclusive Education
链接: https://arxiv.org/abs/2606.30662
作者: Lorenzo Stacchio,Michele Giordano,Daniele Berardini,Primo Zingaretti,Emanuele Frontoni
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The advent of Generative Artificial Intelligence (GenAI), and in particular Large Language Models (LLMs), is reshaping educational practice, while intensifying ethical debate about its adoption. To date, the dominant paradigm remains cloud-based and text-only chatbot: a centralized service that offers limited pedagogical control, weak transparency over knowledge sources, and non-trivial risks for privacy and regulatory compliance. This model also presumes continuous connectivity and recurring API costs, creating structural barriers for many institutions, reinforcing existing digital divides. At the same time, educational interaction with LLM can benefit from multimodal cues and embodied presence, requiring interfaces that move beyond text-only tutoring. In this work, we propose ELEVATE (Efficient LLM Education with Virtual Avatar Teaching Engine), a framework to develop efficient GenAI-driven avatar tutors governed by epistemic infrastructures. ELEVATE integrates LLM-driven dialogue with embodied 3D avatars for multimodal interaction and adopts a local-first execution model enabling deployment on consumer-grade hardware. The framework formalizes a three-stratum design that separates (i) a student-facing virtual avatar interaction layer, (ii) a local GenAI execution and multimodal synthesis core, and (iii) a teacher-facing governance layer. We implemented and evaluated a working prototype deployed in a real-world educational curriculum. The system runs on standard PCs and smartphones, and we provide system-level performance evidence to show responsive interaction under realistic hardware constraints. Finally, we discuss sociotechnical and pedagogical implications for responsible adoption, positioning ELEVATE as a scalable pathway for privacy-preserving and inclusive GenAI tutoring across heterogeneous school environments.
[AI-135] Agent ic AI Enhances Physician Trust in Clinical Decision Making
链接: https://arxiv.org/abs/2606.30658
作者: Zhiling Yan,Zhe Fang,David J King,Ann Pongsakul,Eashan Adhikarla,Hui Ren,Sunyang Fu,Quanzheng Li,Lifang He,Xiang Li,Hongfang Liu,Yonghui Wu,Lichao Sun
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Accepted by AMIA 2026 Annual Symposium
Abstract:Medical AI has shifted from reasoning to agentic AI, a new paradigm that autonomously invokes external tools during reasoning, rendering intermediate reasoning steps and tool outputs transparent to users. Although proven to outperform previous models, physician trust in agentic AI remains largely unexplored. To address this, three physicians evaluated 315 multimodal clinical cases quantifying both process-oriented cognitive trust and outcome-oriented behavioral reliance. Comparing agentic AI against non-agentic baselines, physicians exhibited significantly higher cognitive and behavioral trust for the agentic model (P 0.001). Specifically, on treatment planning tasks, physicians trusted the agentic reasoning most, preferring it in 89.57% of cases. Furthermore, process-oriented cognitive trust is significantly associated with outcome-oriented behavioral reliance (P 0.001). However, measurable over-reliance on incorrect agentic outputs still exists, highlighting the inherent limitations of decision-logic transparency alone and underscoring the continuous need for rigorous clinician oversight.
[AI-136] AI for Quality Assurance in the Operating Room
链接: https://arxiv.org/abs/2606.30657
作者: Pietro Mascagni,Lalith Sharan,Deepak Alapatt,Nicolas Padoy
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Surgical outcomes depend not only on patient factors and postoperative care but are also strongly influenced by the quality of the operation itself. Yet, for much of mod-ern surgery, intraoperative quality has been assessed indirectly through outcomes and operative reports. The increase in minimally invasive procedures inherently guided by endoscopic video, together with advances in artificial intelligence, creates an unprecedented opportunity to systematically observe, measure, and improve surgi-cal care. This chapter introduces AI-enabled Surgical Quality Assurance as a frame-work for using surgical data to support continuous assessment and improvement in the operating room. We first review existing approaches to surgical safety, from sys-tem-level interventions to procedure-specific standards. We then describe how AI can transform intraoperative video into clinically meaningful information, including recog-nition of anatomy, instruments, workflow, surgical actions, quality criteria, adverse events, and critical moments. Finally, we outline the major challenges that must be addressed before these systems can deliver routine clinical value, including representa-tive data collection, robust validation, workflow integration, regulation, liability, pri-vacy, and equitable access. Rather than replacing surgical judgment, AI for quality assurance should be understood as a set of tools for augmenting the surgical team, scaling expert review, and helping surgery evolve toward a learning system in which intraoperative care is continuously observed, assessed, and improved.
[AI-137] Mapping the Artificial Intelligence Divide in Africa: Infrastructure Accessibility and Capacity
链接: https://arxiv.org/abs/2606.30656
作者: Abayomi O. Agbeyangi,Jose M. Lukose
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, book chapter on “Digital Futures and the AI Divide: Inequality, Inclusion, and Innovation in Developing Societies”
Abstract:Artificial Intelligence (AI) has the potential to be transformative for development, but Africa is currently facing a fragmented and challenging “AI divide”. This paper provides an empirical analysis of the current state of the AI landscape and how it compares with Africa’s technological preparedness for the future. In our analysis, we approach the “AI Divide” from three angles: infrastructure, accessibility, and human capacity. First, we look at the physical constraints that prevent Africa from integrating digitally. We then evaluate the human-centred factors that limit the development of AI technology on the continent. Finally, we examine the human capacity to develop AI systems on the continent and provide three focused case studies. Our investigation shows that the physical infrastructure needed to build an AI economy on the continent is lagging, with only 38% internet penetration, poor broadband coverage and less than 1% of all data centres globally. Other constraints include high data costs relative to income, gender-based digital divides, and the need to build more representative NLP models that can understand Africa’s native languages. However, there are positive trends towards the emergence of local initiatives and grassroots movements, such as startups and universities, contributing to AI development on the continent. Based on these findings, we provide concrete recommendations to policymakers to help develop a more comprehensive and equitable AI ecosystem on the African continent.
[AI-138] oward AI-Resilient Assessment in Computer Science Courses in an AI-Native World
链接: https://arxiv.org/abs/2606.30655
作者: Anshumali Shrivastava
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:AI-native course assessments in senior computer science courses and related fields should grade students by \emphAI-resilient skill: the ability to achieve outcomes beyond a strong AI baseline. Such assessments should allow students to use AI freely, while reducing the extent to which greater private AI budget or more intensive AI use, by itself, becomes a grading advantage. This paper proposes a minimal formal framework for this goal. The framework specifies a real task, an executable evaluator, a declared AI-native Pareto frontier, and a grading rule based on Pareto surplus. The central claim is simple: Pareto surplus provides a measurable, protocol-relative certificate that a submitted artifact achieves a tradeoff not already supplied by the declared AI baseline, and grading by this surplus is AI-resilient with respect to that baseline. Interpreting surplus as evidence of student skill requires the surrounding assessment protocol–for example, design reports, ablations, prompt traces, oral checks, or reproducibility explanations–but the grading certificate itself is behavioral and executable. The framework is then extended to practical complications, including self-improving AI loops, budget neutrality, server-mediated feedback, and prompt-based red teaming. As a concrete instantiation, we describe an AI-resilient approximate-membership assignment centered on Bloom filters for COMP 480/580 at Rice University, designed to test whether students can improve beyond AI-generated implementations.
[AI-139] he Consistency Dilemma in LLM s: Generator-Evaluator Agreement and Vulnerability to Mistakes
链接: https://arxiv.org/abs/2606.30653
作者: Marina Mancoridis,Zoë Hitzig
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models are increasingly deployed in agentic pipelines that depend on the model evaluating its own outputs without external verification. The reliability of these pipelines depends on an implicit assumption: that the model applies relevant concepts the same way when it generates an output and later evaluates that output. We propose a new measure, generator-evaluator self-consistency, to test this assumption directly and apply it to 10 frontier models across 491 concepts. We find, first, that there is substantial variation in self-consistency. Second, we find that in a clinical setting with physician-validated mistakes (Proniakin et al., 2025), across models, those with higher self-consistency are linked to greater vulnerability to mistakes. Thus, even when models consistently apply concepts they may not be safe to deploy. This is evidence of a consistency dilemma in LLMs: self-consistency is operationally useful, but models that are more consistent are also more prone to mistakes.
[AI-140] AI Transparency: Governance Compliance or Stakeholder Requirements?
链接: https://arxiv.org/abs/2606.30652
作者: Muneera Bano,Didar Zowghi
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Transparency is increasingly mandated for public-sector AI systems, with organisations required to publish statements describing their AI use and oversight arrangements. However, the existence of such artefacts is often treated as equivalent to transparency itself, despite limited evidence that they proportionately serve relevant stakeholder groups. From a requirements engineering perspective, this raises a validation concern: compliance with mandated disclosure criteria does not necessarily ensure transparency adequacy for stakeholders with different levels of risk exposure, decision control, and involvement. This paper presents an empirical analysis of 92 publicly available AI transparency statements published by Australian Government agencies under the national AI governance mandate. We introduce the stakeholder Risk–Control–Involvement–Need (RCIN) framework to differentiate stakeholder classes according to their structural position and transparency needs. Using a structured rubric derived from the mandated criteria, we evaluate how both the mandate and published statements are calibrated to each stakeholder class. The findings show that while structural compliance is widespread, transparency calibration is uneven. Criteria serving high-control stakeholders are consistently realised, whereas criteria most critical for high-risk, low-control stakeholders are fewer and less substantively addressed. We conceptualise this as the Transparency Illusion: a condition in which transparency appears satisfied through compliant artefacts yet remains unevenly calibrated to stakeholders bearing the greatest exposure to AI-supported decisions. The study frames transparency as a stakeholder-calibrated validation problem, demonstrating that artefact-level compliance does not constitute requirements validation in this context.
[AI-141] Can Physician Expertise Improve Machine Learning Identification of Delirium?
链接: https://arxiv.org/abs/2606.30651
作者: Xinyu Qin,Vicky Ye,Ruiheng Yu,Lu Wang
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for presentation at the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026
Abstract:Delirium is common in hospitalized patients and is often missed in routine care. We present a user-centered interactive machine learning (UC-iML) framework for delirium detection support that combines physician-guided feature refinement with interpretable modeling. Using 3,862 labeled admissions from six Toronto hospitals in the General Medicine Inpatient Initiative (GEMINI), we integrate administrative variables, laboratory results, medications, and a radiology-derived text indicator. Physicians guide feature refinement and model evaluation, and Shapley Additive exPlanations (SHAP) are used to summarize feature attribution. We evaluate standard supervised classifiers with temporally separated holdout testing and a later-phase validation cohort. Compared with automated and baseline variants, the proposed framework shows better overall discrimination and stronger temporal robustness, while the explanations highlight clinically meaningful signals. These results support UC-iML as a practical human-in-the-loop framework for clinically relevant delirium modeling.
[AI-142] Qualified Educational Capacity Planning under Heterogeneous Student Support Needs: A Synthetic Benchmark and Decision-Support Framework
链接: https://arxiv.org/abs/2606.30650
作者: Carlos Eduardo Sanoja,Oscar Enrique Moreno Mayz
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Educational support services often face a qualified-capacity problem: staff time is scarce, qualifications decay, new support needs can appear before anyone is prepared for them, and training consumes the same hours needed by current students. We introduce a synthetic benchmark and decision-support framework for qualified educational capacity planning. The model is a stylized single-institution service system with heterogeneous support-demand categories, backlog-only dynamics, continuous preparation states with hard threshold qualification and decay, and capacity-consuming training. The benchmark includes seed-controlled scenarios for announced and surprise new support categories, staff absences, and demand surges; exact feasibility discipline; declared per-policy information sets; requalification and greenfield-qualification counters; access-dispersion metrics; replay checksums; and paired statistics. We compare service-only, reactive, static-insurance, water-filling, and rolling-horizon mixed-integer controllers, with an attribution chain separating service planning, qualification maintenance, and acquisition, plus a perfect-foresight reference. The central result is a regime map governed by whether a newly required qualification can be acquired within the controller’s reaction reach. When it can, the closed-loop controller wins across the core and adversarial suites, with value concentrated in just-in-time qualification acquisition. When the training lag exceeds the horizon, lean static insurance wins structurally, and a reactive trainer that starts after onset can be worse than no training. Backlog perishability shifts this boundary without erasing either regime. EduCapacity Studio reproduces exported scenarios bit-for-bit. All evidence is stylized and synthetic; the framework makes no claims about real student outcomes, compliance, or individual placements.
[AI-143] Sequential Fairness Auditing with Limited Output Access
链接: https://arxiv.org/abs/2606.30338
作者: Ioannis Pitsiorlas,Martha V. Sourla,Marios Kountouris
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:
Abstract:External evaluations are becoming increasingly central to the governance of AI systems. In practice, however, independent auditors often have limited access to deployed models and must rely on query-based interactions. Most existing fairness evaluation methods assume static datasets and fixed-sample statistical tests, making them poorly suited to real-world auditing scenarios in which evidence must be collected sequentially under query constraints. In this work, we formulate fairness auditing as a tolerance-aware sequential hypothesis-testing problem under limited model output access. We develop a sequential generalized likelihood-ratio framework that allows auditors to accumulate evidence from a finite audit pool and stop once sufficient support for compliance or violation has been obtained. The framework is instantiated for decision-based Statistical Parity and Equal Opportunity audits, and extended to score- and logit-based proxy audits when richer observables are available. Our results show that both the fairness metric and the level of model access significantly affect audit efficiency, and that the benefits of richer output information are not uniform across auditing settings. In particular, richer outputs can substantially reduce the number of queries required for some fairness metrics and operating regimes, while offering limited gains in near-threshold cases. This work provides a practical statistical framework for sequential fairness auditing under realistic deployment constraints.
[AI-144] Improving multichannel speech enhancement through accurate room-acoustic simulations INTERSPEECH
链接: https://arxiv.org/abs/2606.31552
作者: Georg Götz,Alessia Milo,Steinar Guðjónsson,Daniel Gert Nielsen,Jesper Pedersen,Finnur Pind
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted for publication at Interspeech
Abstract:Room-acoustic simulations are widely used to augment training data for deep-learning-based speech enhancement. While most pipelines rely on simplified geometrical acoustics, wave-based approaches offer greater physical accuracy. In this work, we examine how simulation fidelity affects multichannel speech enhancement performance. To this end, we train SpatialNet on datasets augmented with different room-acoustic simulation methods and evaluate the resulting models on measured data. We compare lower-fidelity datasets based on geometrical acoustics with a high-fidelity dataset using advanced acoustic modelling and a hybrid combination of wave-based and geometrical acoustics simulations. Training on the high-fidelity dataset results in an up to 38 % relative reduction in median word error rate compared to the lower-fidelity alternatives. These results show that augmentation with high-fidelity room-acoustic simulations directly translates into improved multichannel speech enhancement performance.
[AI-145] Von Mises Based Uncertainty Quantification for Closely Spaced Automotive Radar Targets
链接: https://arxiv.org/abs/2606.31473
作者: Vinay Kulkarni,V. V. Reddy
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Probability (math.PR)
备注: 12 pages, 5 figures
Abstract:This work investigates uncertainty-aware deep learning approaches for direction of arrival (DOA) estimation in automotive radar, focusing on probabilistic modeling and downstream integration. A circular-statistics-based von Mises (VM) ensemble (ENS) is compared with an evidential deep learning (EDL) framework based on a normal inverse gamma formulation, yielding a Student t predictive distribution in the Euclidean domain. The ENS framework produces angular predictions parameterized by (mu, kappa), enabling interpretable uncertainty aligned with directional geometry. Performance is evaluated under in distribution and multiple out-of-distribution conditions using risk coverage and ROC or AUROC analyses. Results indicate that ENS achieves lower uncertainty under nominal conditions and exhibits stronger sensitivity to severe perturbations, whereas EDL provides smoother uncertainty variation and slightly improved ranking consistency. Importantly, the ENS representation enables direct probabilistic integration into association modules via closed form VM likelihoods, facilitating a unified detection tracking pipeline. These findings highlight a trade-off between geometric consistency and statistical generality in uncertainty-aware DOA estimation.
[AI-146] From Materials Database to Materials Bank: Assetizing Data for AI Driven Materials Innovation
链接: https://arxiv.org/abs/2606.31366
作者: Chenyao Ma,Di Zhang,Weibo Gong,Wei Du,Rui Su,Yuhang Chen,Kan Xu,Huan Gu,Limin Li,Piao Ma,Zhenghao Li,Hao Li
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注:
Abstract:Driven by high-throughput experimentation, computational modeling, and artificial intelligence (AI), materials data has expanded at an unprecedented rate. Conventional materials databases function only as passive repositories, archiving raw experimental records indiscriminately including both successful and failed data, without systematic value filtering or asset management. This creates a critical gap between massive data accumulation and actionable innovation, hindering the identification of high-potential materials and industrial translation. To address this bottleneck, we propose an industrialization-oriented Materials Bank, a dedicated valuefiltering and assetization layer that operates beyond traditional databases. It does not merely curate high-quality data but systematically elevates qualified candidates into standardized, upgradable materials assets via a multi-dimensional BankCard framework covering scientific validity, synthesis feasibility, application readiness, and industrial value. By unifying databases, AI models, automated experimentation, and multi-criteria assessment into a cohesive closed-loop ecosystem, the Materials Bank establishes a clear trajectory from data to knowledge, candidate, asset, and product. It serves not as an enhanced database or screening tool, but as a decision infrastructure bridging academic discovery and industrial demand, offering a scalable paradigm to accelerate AI-driven materials innovation and deliver tangible real-world impact.
[AI-147] PGUDA: Pressure-Guided Unsupervised Domain Adaptation with Cross-Modal Knowledge Distillation for sEMG-Based Gesture Recognition
链接: https://arxiv.org/abs/2606.31349
作者: Yurui Liu,Xiao-Cong Zhong,Qisong Wang,Xuefu Wang,Dan Liu,Jinwei Sun
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:
Abstract:Surface electromyography (sEMG)-based gesture recognition has emerged as a promising technology for natural human-computer interaction. However, its practical deployment remains challenging due to severe performance degradation caused by feature distribution discrepancies across different subjects and recording sessions. Although domain adaptation (DA) techniques are commonly employed to mitigate such discrepancies, conventional methods often struggle to effectively aligning sEMG features, primarily due to their inherent stochasticity and the scarcity of labeled data. To address these limitations, this paper proposes a novel Pressure-Guided Unsupervised Domain Adaptation (PGUDA) framework, which leverages the robustness and stability of pressure signals to introduce a cross-modal knowledge distillation strategy that transfers consistent physical semantics across modalities. Specifically, a teacher network trained on pressure signals guides an sEMG student network on unlabeled target domains, thereby regularizing the representation learning process with transferable and modality-invariant knowledge. Extensive experiments conducted on a self-collected multimodal dataset involving eleven subjects validate the effectiveness of the proposed PGUDA framework. The results demonstrate that our proposed PGUDA achieves leading performance in both cross-subject and cross-session classification tasks, achieving average accuracies of 58.08% and substantially outperforming existing DA approaches. Notably, PGUDA exhibits remarkable label efficiency: it attains classification accuracy comparable to fully supervised benchmarks while requiring only 5% of labeled data for teacher network training. This framework offers a robust and data-efficient solution that can significantly reduce the calibration burden in practical sEMG-based gesture recognition systems.
[AI-148] Minimizing Quantized Semantic Age of Information (QSAoI) in Foundation Model-Based Semantic Communications
链接: https://arxiv.org/abs/2606.31303
作者: Huanyu Zhang,Yulin Hu,Xiaopeng Yuan,Aydin Sezgin,Anke Schmeink
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE SPAWC 2026
Abstract:The emerging techniques of semantic communications and edge computing in 6G networks necessitate a paradigm shift toward co-designed semantic-aware and adaptive resource allocation for short-packet transmissions. However, there is a fundamental gap between the semantic layer and the physical layer under low-latency finite blocklength (FBL) effects. To bridge this gap, we introduce the Quantized Semantic Age of Information (QSAoI), a novel metric that rigorously captures the trade-offs among freshness and semantic efficiency of high-level features in real-time communication in the FBL regime. Guided by this metric, we propose a novel foundation model-based efficient co-designed framework to minimize the expected QSAoI over wireless fading channels in latency-constrained semantic communication. Specifically, we formulate a non-linear joint optimization problem to dynamically optimize the block-wise mixed-precision quantization (MPQ) strategy and the physical blocklength. To efficiently resolve this complex problem, we develop a high-efficiency low-complexity algorithm based on fixpoint inspection and bisection search. Extensive simulations validate that our proposed algorithm dynamically adapts the semantic quantization precision to varying channel conditions, effectively minimizing the expected QSAoI compared to baselines.
[AI-149] Detecting Audio Deepfakes on the Edge:Lightweight SSL-Based Detection in a Browser Plugin
链接: https://arxiv.org/abs/2606.30780
作者: Octavian Pascu,Dan Oneata,Horia Cucu,Nicolas M. Muller
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:Audio deepfakes are a growing challenge for the general public, as well as for journalists and fact-checkers. The latter need reliable tools to verify the authenticity of their sources, while at the same time keeping their information private. Commercial deepfake detection solutions rely on cloud-based processing, which raises privacy concerns. To solve this problem, we propose an on-device audio deepfake detection model. We show that a truncated self-supervised backbone with a simple logistic classifier is both very fast and often more accurate than existing solutions. Our solution outperforms the baseline AASIST by 10% and improves inference speed by 40%. We integrate this model into a browser plug-in, which allows journalists and fact-checkers to detect deepfakes easily and securely. Code for the plugin is available at this https URL.
[AI-150] Modeling Cell-Cycle-Aware Single-Cell Drug Perturbation Responses
链接: https://arxiv.org/abs/2606.30695
作者: Dingping Zhao,Jie Lin
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:
Abstract:Single-cell drug perturbation models should predict not only transcriptional response magnitude, but also whether a treatment alters the proliferative state of a cell. This is challenging because cell-cycle variation is often treated as nuisance variation, and benchmark pipelines rarely treat drug-induced phase changes as a primary prediction target. We introduce scCycleMol, a cell-cycle-aware perturbation prediction framework built on a curated 24-hour SciPlex3 benchmark with standardized molecule identities, dose and cell-line metadata, and gene expression with cell-cycle supervision derived from treated states. Instead of using cell-cycle state as an input covariate, scCycleMol derives supervision from predicted treated expression and propagates it through a learnable full-expression cell-cycle head with circular G1/S/G2M phase targets. We evaluate marker-based supervision, molecular representations, and pretraining strategies to isolate sources of improvement. Across a SciPlex3 benchmark with over 600k cells, 186 perturbation conditions, multiple cancer cell lines, and thousands of genes, scCycleMol improves out-of-distribution expression prediction compared with conditional perturbation baselines. The best LINCS-pretrained circular model achieves 0.9093 expected all-gene r squared and 0.6843 expected differentially expressed gene r squared, compared with 0.6800 and 0.5400 for LINCS-pretrained ChemCPA. Closed-loop cell-cycle supervision improves phase accuracy by about 0.5 to 0.6 points while maintaining nearly unchanged expression prediction. A Tahoe-pretrained variant reaches 0.9609 phase accuracy, highlighting the benefit of explicit cell-cycle-aware supervision in perturbation modeling.
[AI-151] A Coherence Law for Trainability in Noisy Equivariant Quantum Neural Networks
链接: https://arxiv.org/abs/2606.30688
作者: Hassan Ugail,Newton Howard
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
备注:
Abstract:Symmetry provides a quantum neural network structure, but on its own it does not keep the network trainable once noise is present. We ask which physical quantity decides whether the gradients of an equivariant circuit survive decoherence, and we answer with a compact training law. Working with U(1)-equivariant brickwork circuits that conserve a charge, we find that two distinct effects govern a trainable gradient. Causality fixes where the gradient can live, confining it to the backward light cone of the readout inside the active charge sector. Coherence then determines how fast it decays through the contraction of the off-diagonal sector modes that the projected readout can actually observe. We prove a light-cone reduction that pins the noiseless gradient to the sector-restricted cone with a lower bound independent of the total qubit number, and we define a readout-visible aligned coherence rate as a Rayleigh quotient of the noise generator along the gradient-carrying mode. A perturbative open-system analysis turns this rate into a leading-order training law. Density-matrix simulations then confirm that the finite-noise degradation follows a single accumulated variable built from noise depth and coherence contraction, with a coefficient of determination of 0.979. The sharpest test comes from a correlated-dephasing channel that has a large worst-case rate but a near-zero aligned rate. The law predicts no gradient loss for this channel, and none is seen. Sector coherence outperforms every standard channel diagnostic we compare it against, and the analysis identifies readout-visible sector coherence as the quantity that links equivariant architecture, open-system dynamics and noisy trainability.
[AI-152] Unsupervised Thermodynamics of Molecular Diffusion Models: Action-Operator Semantics and Auditable Free-Energy Readout
链接: https://arxiv.org/abs/2606.30687
作者: Wenjie Xi
类目: Chemical Physics (physics.chem-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models are increasingly utilized for modeling molecular structures and conformational ensembles, yet the thermodynamic meaning of their learned representations and scores remains elusive. To resolve this ambiguity, we introduce a mathematically consistent action-operator framework natively compatible with diffusion models. By defining a fixed molecular environment as a base action S_0(x) and an alchemical perturbation as an operator O(x) , standard diffusion noising induces effective noised actions and operators whose gradients and alchemical derivatives are directly represented by the model’s learned fields. This rigorous self-consistency enables a ``noisy operator bridge’’ capable of reading out free-energy differences ( \Delta F ) from endpoint ensembles and per-frame evaluations. In controlled experiments on alanine dipeptide systems, we show that incorporating physical inductive biases enables partial recovery of the base action and perturbation operator. When applied to a challenging C6-H to C6-F ligand-pocket nonbonded perturbation (185L/IND) with negligible phase-space overlap, our supervised bridge estimates the alchemical \Delta F within approximately 1\ k_\mathrmBT of a stable 19-state MBAR reference. Finally, we demonstrate that endpoint coordinates and binary labels alone are sufficient to partially recover the operator shape and a centered free-energy scale without any force or action supervision. This work provides a rigorous path toward transforming generative molecular diffusion models from black-box coordinate samplers into auditable thermodynamic estimators.
[AI-153] Listening Between the Lines: Joint Learning of ASR Embeddings and LLM -Augmented Linguistics for Dementia Detection INTERSPEECH2026
链接: https://arxiv.org/abs/2606.30675
作者: Olivier Jiyoun Jung,Jonghyeon Park,Myungwoo Oh
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Accepted at INTERSPEECH 2026
Abstract:Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for dual-purpose extraction: acoustic representations from encoder outputs and transcripts via automatic speech recognition (ASR). For the acoustic pathway, temporal networks with attention pooling aggregate variable-length sequences into fixed-dimensional embeddings. For the linguistic pathway, we prompt a large language model (LLM) to extract interpretable features spanning lexical diversity, syntactic complexity, semantic coherence, and discourse patterns. A gated fusion network integrates both modalities. On ADReSS and ADReSSo, our method achieves F1-scores of 89.47% and 90.14%, demonstrating effective integration of acoustic and LLM-augmented linguistic features. Ablation shows that multimodal fusion consistently outperforms either modality alone.
[AI-154] Estimating the Effect of Timing on Coupon Effectiveness KDD2022
链接: https://arxiv.org/abs/2606.30664
作者: Deddy Jobson
类目: Applications (stat.AP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 5 figures. Published in Proceedings of the 1st Workshop on End-End Customer Journey Optimization, co-located with KDD 2022, August 15, 2022, Washington, DC
Abstract:The coupon incentive is one of the most common tools marketers use to court users to engage with a business at various stages of the customer life cycle. A variety of factors can affect the effectiveness of a coupon incentive on users, timing being one of them. We hypothesize that coupons can be more effective when delivered at critical times in the customer journey, right when a user is engaging with the platform. Verifying such a hypothesis would typically require real time event-triggered coupon distribution software that may be too expensive to implement. In this paper, we propose a framework in which we apply causal inference on “natural randomized control trial experiments” to measure the effectiveness of sending coupons at the right time to users without requiring a dedicated AB test. We demonstrate the usefulness of our framework in the case of a user onboarding coupon campaign held in our company and show how the results can lead to correct data-driven decisions for the business. Furthermore, in order to test the generalizability of our framework, and to make our research more reproducible, we apply our framework on a user retention campaign with a publicly available dataset.
[AI-155] Surrogate-Gated Generation and Foundation-Model Embeddings for Bayesian Materials Design
链接: https://arxiv.org/abs/2606.28578
作者: Sk Md Ahnaf Akif Alvi,Jan Janssen,Danny Perez,Douglas Allaire,Raymundo Arroyave
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注:
Abstract:Closed-loop materials discovery iterates between proposing candidate structures and evaluating their properties, and property evaluation dominates the cost. In the generative variant, a learned prior proposes candidate crystals and a property oracle scores them; we ask whether a cheap probabilistic surrogate can triage the generator’s output, and what such a surrogate must do well. Across three architecturally distinct pretrained diffusion priors (MatterGen, CrystalFlow, ADiT) and two targets (room-temperature heat capacity and bulk modulus), we insert a Gaussian process acquisition gate between structure generation and the oracle in an RL-steered generative workflow. The gate matches or exceeds ungated fine-tuning of the generative model while capping oracle calls at a fixed per-cycle budget. Budget-matched ablations isolate the mechanism. At an identical four-call budget, ranking-based selection outperforms arbitrary selection, confirming that the gain comes from the surrogate’s choice; the gate comes within \sim 9% of exhaustive oracle spending at roughly one-fifth of the calls. A density-functional-theory check of the bulk-modulus discoveries confirms the learned oracle to within 2.5% on average and the surrogate’s ranking of the generated structures at Spearman \rho = 0.94 . A cross-factorial benchmark of surrogate performance spanning mechanical, electronic, and vibrational properties identifies pretrained ORB embeddings with a Gaussian process as the most reliable combination, which we adopt as the building blocks of the proposed workflow. The complete pipeline is released as open-source software.
机器学习
[LG-0] FedLAB: Traceable Semantic Codebooks for Federated Multimodal Graph Foundation Learning
链接: https://arxiv.org/abs/2606.32016
作者: Zekai Chen,Kairui Yang,Xuaner Chen,Xunkai Li,Xun Wu,Rong-Hua Li,Guoren Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multimodal graph foundation models aim to learn reusable knowledge from graphs enriched with text, images, attributes, and relational topology, thereby supporting diverse graph-centric and modality-centric tasks. In practice, however, such multimodal graphs are often distributed across decentralized clients, where raw contents and local structures cannot be centrally shared due to privacy constraints. This motivates federated multimodal graph foundation learning, which requires not only transferable representation learning but also intrinsic semantic traceability under strict data isolation. Existing methods usually exchange or store knowledge through parameters, prototypes, embeddings, or compact codebooks, which support optimization and transfer but do not explicitly expose how modality evidence, node semantics, and topology context jointly support predictions. To bridge this gap, we propose FedLAB, a traceable semantic codebook framework that organizes multimodal graph knowledge into typed hierarchical codebooks for modality evidence, node semantics, and topology context. FedLAB further refines these trace units through federated semantic barycenter pre-training while keeping raw multimodal contents and graph structures local. Extensive experiments on 10 benchmarks and 6 downstream tasks show that FedLAB improves over state-of-the-art baselines by up to 7.53%, while preserving a native semantic trace interface.
[LG-1] Surrogate Fidelity: When Can Open LLM s Explain Closed Ones?
链接: https://arxiv.org/abs/2606.32008
作者: Philippe Chlenski,Zachariah Carmichael,Ayush Warikoo,Chia-Tse Shao,Yingxiao Ye,Aobo Yang,Vivek Miglani,Nehal Bandi
类目: Machine Learning (cs.LG)
*备注:
Abstract:Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model’s representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at this https URL.
[LG-2] Evaluation of Population Initialization Methods for Genetic Programming-based Symbolic Regression
链接: https://arxiv.org/abs/2606.31990
作者: Lukas Kammerer,Gabriel Kronberger,Deaglan J. Bartlett,Harry Desmond,Pedro G. Ferreira,Stephan Winkler
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 15 pages. Accepted for publication at EUROCAST 2026: 20th International Conference on Computer Aided Systems Theory
Abstract:We analyze the effect of optimizing the initial population of genetic programming (GP) for symbolic regression (SR) on the accuracy and complexity of solutions. We compare three well-established random initialization methods as well as initialization with small optimized solutions from exhaustive symbolic regression (ESR) using a GP/SR implementation which is based on the multi-objective evolutionary algorithm NSGA-II. We compare the final Pareto fronts found with each initialization method on twelve synthetic problems of varying complexity and one real-world dataset. We find no significant differences in accuracy or model complexity among the initialization methods. The initial advantage of initialization with ESR disappears after only a few generations. Our results show that, given similar diversity in the initial population, the effect of the initialization method in GP-based symbolic regression on the final Pareto front is negligible.
[LG-3] Semantic Leakage and Privacy Preservation in Relay-Assisted Semantic Communications
链接: https://arxiv.org/abs/2606.31973
作者: Yalin E. Sagduyu,Tugba Erpek,Aylin Yener,Sennur Ulukus
类目: Networking and Internet Architecture (cs.NI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Semantic communication (SemCom) has emerged as a promising paradigm in which the transmission of task-relevant information is prioritized over raw data, enabling efficient and robust communication under resource and channel constraints. In this paper, the privacy implications of relay-assisted SemCom systems are studied, where the intermediate relay node operates directly on learned latent representations. It is shown that the relay, even without access to source data, can reliably infer semantic meaning and reconstruct signals with performance comparable to that of the legitimate receiver, revealing a fundamental privacy vulnerability of semantic representations. To address this issue, an iterative adversarial training framework is proposed in which a strong, adaptively trained eavesdropper at the relay is explicitly accounted for. The proposed approach alternates between optimizing the relay’s eavesdropping function and the legitimate system, resulting in representations that preserve semantic decoding performance at the intended receiver while degrading semantic inference at the relay. The semantic accuracy gap between the legitimate receiver and the eavesdropper is significantly enlarged across channel conditions. Importantly, this protection is achieved in a stealthy manner, with high reconstruction fidelity maintained while semantic leakage is selectively suppressed.
[LG-4] Making Sense of Touch from the Childs View for Contrastive Learning
链接: https://arxiv.org/abs/2606.31943
作者: Max Whitton,Zecheng Wang,Puchen Liu,Quang Tuan Truong,Shengao Wang,Manaswi Yadamreddy,Oktay Ozel,Visista Jayanti,Saniya Sekhon,Hanna Samuel Tadesse,Lawrence Miao,Junjie Wang,Jiasen Lu,Chen Yu,Boqing Gong
类目: Machine Learning (cs.LG)
*备注:
Abstract:Is the sense of touch a mechanism for human babies’ learning of visual concepts? If so, can we quantify its importance, and to what extent do babies rely on their sense of touch for visual learning? To approach these questions in a principled way, we propose a structured coding system for baby-centric touch events, yielding a dataset of 264k two-second clips of touch events coded according to this system. Using this dataset, we pretrain developmentally grounded models that reveal promising insights into the nature of baby learning from touch.
[LG-5] Interface-Aware Neural Newton Preconditioning for Robust Cohesive Zone Model Simulations
链接: https://arxiv.org/abs/2606.31921
作者: Zhangyong Liang,Huanhuan Gao
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:
Abstract:Cohesive Zone Models (CZMs) are widely used to simulate interface fracture, delamination, adhesive failure, and fiber–matrix debonding in aerospace composite structures. In implicit quasi-static finite element analyses, cohesive softening may introduce negative interface tangents, solution jumps, and Newton-basin mismatch, so the previous converged state can become a poor initial guess for the next increment. This may lead to stagnation, wrong-branch convergence, or repeated step cuts. Existing remedies, including viscous regularization, path following, dynamic relaxation, and manual Newton–Raphson (NR) modification, either alter the effective response, increase cost, or rely on hand-crafted interface rules. This work proposes an Interface-Aware Neural Newton Preconditioner (IA-NNP) for difficult CZM increments. IA-NNP recasts manual NR modification as rule-based interface lifting and generalizes it into a learned, state-dependent interface correction. The method acts only on active interface variables and preserves the original traction–separation law, residual assembly, tangent evaluation, history update, and dissipation checks. Two realizations are developed: IA-NNP-Init for learned initial-guess lifting and IA-NNP-NL for iteration-level nonlinear right preconditioning. Interface graph features encode opening, traction, tangent, damage/history variables, mode mixity, residuals, and neighboring states. The correction is bounded, confidence-gated, and accepted only through the original CZM Newton solve. A root-equivalence property shows that IA-NNP changes the path to convergence but not the discrete CZM solution set. Tests on horizontal, circular, two-interface, and active-front benchmarks show improved difficult-increment convergence, better branch recovery, and fewer failures than standard NR and manual NR modification, while preserving the force–displacement response.
[LG-6] Sequential RC-TGAN: Generating Relational Time Series with Spectral Envelope Loss
链接: https://arxiv.org/abs/2606.31904
作者: Mohamed Gueye,Yazid Attabi,Manuel Morales,Maxime Dumas
类目: Machine Learning (cs.LG)
*备注:
Abstract:The generation of synthetic relational databases often involves modeling complex temporal dynamics, such as transaction logs or event sequences. A significant challenge in this domain is the handling of categorical time series (e.g., status codes), where standard encoding methods like one-hot encoding fail to capture intrinsic frequency-domain features such as seasonality and cyclicity. In this paper, we introduce Sequential RC-TGAN (Seq. RC-TGAN), a temporal extension of the RC-TGAN framework, equipped with a novel integrated loss function based on the \textitSpectral Envelope Theory. This differentiable loss allows the generator to directly optimize the preservation of latent periodic structures via backpropagation. While spectral envelope theory is inherently designed for categorical sequences, we extend this frequency-domain regularization to continuous time series by employing a Variational Gaussian Mixture Model (VGM) discretization strategy. To establish a mathematically rigorous evaluation standard, we simulate categorical time series governed by a parameter \alpha , with exactly known theoretical spectral envelopes. Integrating these dynamic sequences into the child tables of a relational database yields a robust ground-truth benchmark for evaluating the frequency-domain fidelity of our generative framework. Furthermore, we address the lack of robust evaluation standards for relational time series by proposing two new metrics: Spectral Density Divergence and Spectral Envelope Divergence. Experimental results on real-world datasets, as well as our simulated benchmarks, demonstrate that our end-to-end approach significantly outperforms state-of-the-art systems in reproducing cyclic patterns and long-term seasonality across both categorical and continuous features.
[LG-7] Low-dimensional topology of deep neural networks ICML2026
链接: https://arxiv.org/abs/2606.31856
作者: Junyu Ren,Lek-Heng Lim
类目: Machine Learning (cs.LG); Geometric Topology (math.GT)
*备注: Accepted at ICML 2026
Abstract:We study layered models, including feedforward networks, ResNets, and transformers, by limiting each layer to a width of d = 3 , i.e., \mathbbR^3 as representation space. This allows us to track how a neural network changes low-dimensional topological invariants through its layers. Just about any topological structure may be simplified or even trivialized by simply increasing dimension; e.g., any knot is equivalent to an unknot in \mathbbR^4 . By restricting to \mathbbR^3 , we not only isolate the effects of activation and depth from that of width, we work in a space that lends itself to easy visualization. We focus on linking number here, deferring other invariants like link groups, Milnor’s \bar\mu -invariants, knot types, ambient cobordisms, to a sequel. We provide full proofs and empirical experiments to justify the following insights: When measured by their power to effect changes in linking numbers, the layer-skipping feature in ResNets is as powerful as the attention mechanism in transformers; both ResNets and transformers are strictly more powerful than feedforward neural networks with monotonic activations, which are in turn more powerful than invertible and flow-based models; but replacing monotonic activation with a nonmonotonic one elevates a feedforward network into the same expressivity class as ResNets and transformers. These results suggest that low-dimensional topology can be a useful tool to guide designs of AI architectures. We also generalize our results from d = 3 to arbitrary d 3 .
[LG-8] Relational and Sequential Conformal Inference for Energy Time Series over Graphs via Foundation Models
链接: https://arxiv.org/abs/2606.31804
作者: Keivan Faghih Niresi,Alice Cicirello,Olga Fink
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Under-review
Abstract:Accurate energy demand forecasting is essential for the reliable operation and planning of modern sustainable energy systems. Spatial-temporal graph neural networks (STGNNs) have recently achieved strong performance in point forecasting by jointly modeling temporal dynamics and relational dependencies across interconnected energy nodes. However, in real-world energy systems, accurate point forecasts alone are insufficient, as operators also require reliable uncertainty estimates to support risk-aware decision-making, grid stability, and operational planning under uncertainty. Conformal prediction provides a principled and model-agnostic framework for uncertainty quantification with statistical coverage guarantees, making it particularly attractive for safety-critical energy applications. However, existing conformal prediction approaches often fail to fully capture the complex spatial-temporal structure of energy systems. To address these limitations, we propose STOIC (Spatial-Temporal Graph Conformal Prediction with In-Context Learning), a novel framework that integrates graph-based forecasting with the zero-shot calibration capabilities of tabular foundation models. STOIC first generates point forecasts using an STGNN and subsequently reformulates spatial-temporal residuals into a tabular representation suitable for in-context learning. Leveraging a tabular foundation model, STOIC calibrates prediction intervals without task-specific retraining, effectively capturing both sequential and relational dependencies. We evaluate STOIC on five diverse benchmarks, including synthetic simulations as well as real-world electricity and district heating networks. Across all datasets, STOIC consistently outperforms existing conformal prediction baselines, delivering more reliable and robust uncertainty estimates for complex graph-structured energy time series.
[LG-9] Policy Optimization Achieves Data-Dependent Regret Bounds in MDPs with Unknown Transitions
链接: https://arxiv.org/abs/2606.31769
作者: Mingyi Li,Taira Tsuchiya,Kenji Yamanishi
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 70 pages, 2 tables
Abstract:We study policy optimization for online episodic tabular Markov decision processes with unknown transition kernels, aiming for best-of-both-worlds guarantees together with data-dependent regret bounds. Recent work (Dann et al., 2023; Li et al., 2026) has shown that policy optimization can adapt to both adversarial and stochastic losses with first-order, second-order, and path-length bounds, but only under known transitions, leaving open whether such data-dependent guarantees are achievable by policy optimization when the transition kernel is unknown. We resolve this by developing a new algorithm based on optimistic follow-the-regularized-leader that attains these guarantees under unknown transitions. The key ingredient is a new design of optimistic Q -function estimators together with a data-dependent transition bonus that controls estimator bias through the loss-prediction error. Our analysis further identifies an unavoidable transition-dependent complexity term that captures the intrinsic cost of estimating the transition kernel. As a result, we obtain first-order, second-order, and path-length bounds with the transition-dependent complexity term while simultaneously achieving gap-dependent \mathrmpolylog(T) regret in the stochastic regime.
[LG-10] Addressing Over-Refusal in LLM s with Competing Rewards
链接: https://arxiv.org/abs/2606.31748
作者: Taeyoun Kim,Aviral Kumar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Safety training on language models often induces over-refusal: improved safety on harmful prompts at the cost of increased refusal on harmless ones. Though this trade-off can be mitigated by training models with reinforcement learning (RL) to reason before answering, it does not remove the underlying problem that reasoning can often be a “rubber stamp” for a predetermined response. In this paper, we address the safety-refusal trade-off by rethinking how models are trained to reason about safety. Our key insight is that unsafe reasoning can itself serve as a useful exploratory signal. Rather than preemptively blocking harmful thoughts, we encourage the model to sufficiently explore unsafe reasoning but produce a safe response. The harmful exploration improves the model’s ability to distinguish harmful from harmless prompts by resolving ambiguity, allowing it to remain safe while complying only when appropriate. We cast this as an adversarial optimization problem in which a reasoning player explores strategies for producing an unsafe response and an answer player ensures that the final output is safe. We train a single model with dense rewards to play both roles within one chain-of-thought, across different segments. To achieve this, we find that process rewards are crucial for stable optimization of competing objectives. Our resulting model SEAR deliberately engages in harmful reasoning as exploration while reliably flipping back to a safe answer. We demonstrate that this behavior helps mitigate over-refusal and defend against attacks that directly manipulate the reasoning to be harmful.
[LG-11] Nonlinearity-Aware LoRA: Structured Gate Adaptation under Low-Rank Constraints
链接: https://arxiv.org/abs/2606.31717
作者: Shuai Yuan,Sudong Cai,Bingzhi Chen,Shuyuan Zheng,Chuan Xiao,Makoto Onizuka,Rui Mao
类目: Machine Learning (cs.LG)
*备注: 19 pages, 4 figures, 5 tables. Under review
Abstract:Low-rank adaptation (LoRA) is commonly viewed as an update-space approximation to full fine-tuning, yet this view is incomplete for self-gated Transformer feed-forward networks. In gated FFNs, a low-rank residual can change not only projected features but also the nonlinear selection weights that determine which channels contribute to the output. We formalize this effect as selection misalignment and connect it to the local effective homogeneity of self-gated activations. This motivates a nonlinearity-aware principle for parameter-efficient fine-tuning: low-rank updates should allocate capacity to gate channels whose nonlinear states remain responsive and should shape the temporal evolution of selection. We propose NA-LoRA, a training-only method with two lightweight mechanisms: a derivative-based temporal-importance mask for gate-related LoRA updates and an activation-specific step-scaling rule when a meaningful coarse effective-homogeneity partition is available. NA-LoRA adds no auxiliary loss and incurs no inference-time overhead. Experiments on language-model fine-tuning and vision-language transfer benchmarks show that NA-LoRA consistently improves over vanilla LoRA and is competitive with or better than strong PEFT variants.
[LG-12] Diffusing Blame: Task-Dependent Credit Assignment in Biologically Plausible Dual-Stream Networks
链接: https://arxiv.org/abs/2606.31700
作者: Yutaro Yamada,Luca Grillotti,Rujikorn Charakorn,Sebastian Risi,David Ha,Robert Tjarko Lange
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: ALIFE2026
Abstract:Biological neural circuits obey Dale’s principle: each neuron’s synapses are uniformly excitatory or inhibitory. Artificial networks that respect this constraint must coordinate separate excitatory and inhibitory populations, fundamentally changing how credit is assigned during learning. Several biologically plausible learning rules avoid backpropagation’s weight transport requirement, but it has been difficult to achieve strong performance under Dale’s principle beyond MNIST. Error Diffusion (ED) was originally proposed in a dual-stream excitatory/inhibitory architecture, where learning is driven by routing global error signals to all layers without transporting transposed forward weights or relying on random feedback matrices. Whether such a rule can scale under Dale’s principle across both supervised classification and reinforcement learning remains unknown. Here, we introduce modulo error routing to extend Error Diffusion beyond binary classification, and show that a dual-stream excitatory/inhibitory architecture trained with this method achieves 96.7% on MNIST and establishes a 61.7% baseline on CIFAR-10, demonstrating that representation learning is possible even when strictly enforcing Dale’s principle. For the classification setting, we introduce three domain-specific innovations: layer-specific sigmoid widths, batch-centered class error signals, and asymmetric initialization, and ablation analysis reveals that their relative importance reverses between MNIST and CIFAR-10, exposing task-dependent credit-assignment bottlenecks invisible to single-benchmark evaluation. In reinforcement learning, we integrate ED with Proximal Policy Optimization (PPO) and evaluate it on continuous-control tasks in Google Brax and on Craftax, an open-ended exploration task. We show that ED-PPO achieves competitive performance relative to Direct Feedback Alignment, a backpropagation-free baseline.
[LG-13] Calibration Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models
链接: https://arxiv.org/abs/2606.31630
作者: Jian Xu,Delu Zeng,John Paisley,Qibin Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Language models increasingly write probabilistic programs (in NumPyro, Stan, or Pyro), but a program that compiles, runs, and passes every unit test can still be \emphstatistically wrong – a Gaussian likelihood for heavy-tailed data, a Poisson for over-dispersed counts, an invalid prior support, or a pathological parameterization. The right verifier is therefore not a test suite but the Bayesian workflow itself: posterior predictive checks, simulation-based calibration, sampler diagnostics ( \hat R , divergences, ESS), and held-out predictive density. We study this calibration oracle along three axes. \textbfDetection: on a benchmark of 14 misspecification types across 10 model families ( 200 instances), it flags the bug with AUC 0.97 ( 88% at 2% FPR \emphwhen handed the correct reference program, an upper bound) – and a fully \emphreference-free version that uses no correct program reaches 62 – 78% (the upper figure from a small automated model search), versus 0% for a unit-test oracle. \textbfRepair: used as feedback in an LLM repair loop across fifteen models, calibration significantly outperforms unit-test feedback – which is itself \emphsignificantly worse than no feedback at all, a passing test inducing false confidence that suppresses repair – and improves over no feedback on strong-but-unsaturated models (GPT-5.1 33\to92% , Claude 75\to100% ; paired McNemar, n=228 ). \textbfReality: on programs LLMs write from scratch for neutral briefs, 15 – 47% of runnable ones are statistically misspecified (unit tests catch none), and calibration-guided repair significantly beats LLM-as-judge review, a Bayesian-workflow checklist, and data-summary self-debug. Across all three, the lesson is the same: for probabilistic programs, correctness is calibration, not compilation.
[LG-14] From Failure to Alignment: A Requirements Engineering Framework for Machine Learning Systems
链接: https://arxiv.org/abs/2606.31589
作者: Amel Bennaceur,Gopi Krishnan Rajbahadur,Prince Mercy,Bashar Nuseibeh,Faeq Alrimawi
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Organisations designing, developing, and deploying machine learning systems (MLS) need to be able to check that these systems are trustworthy, and communicate this clearly to their stakeholders, be they different categories of users, engineers, or wider society. By focusing on stakeholders, Requirements Engineering is well positioned to drive the design and engineering of MLS that align with the needs of their stakeholders. Yet, we still need a systematic process for modelling and reasoning about requirements for MLS that is driven both by stakeholders’ needs and constraints for MLS development. This paper proposes a framework entitled REAL (Requirements Engineering for mAchines that Learn - and Fail) to help develop MLS that align with stakeholders’ needs by adopting a requirements engineering approach. This model-based framework is based on three principles. First, weaving together requirements for data, models, and the system as a whole. Second, using failure to drive the exploration of alternative requirements. Third, iterative and traceable refinement of MLS requirements. We demonstrate the proposed framework using an example from autonomous driving and show that REAL supports the development of MLS that better align with stakeholders’ requirements. A replication package is available online. Comments: 12 pages Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG) Cite as: arXiv:2606.31589 [cs.SE] (or arXiv:2606.31589v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2606.31589 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: RE 2026 - the 34th IEEE International Requirements Engineering Conference
[LG-15] Robustness of neural networks to random noise perturbations of their inputs
链接: https://arxiv.org/abs/2606.31581
作者: Mark Levene,Martyn Harris
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注:
Abstract:We investigate the problem of the robustness of a trained neural network to the perturbation of its input values. More specifically, we examine the interplay between the accuracy of the network, as measured by the mean squared error, and robustness. Accordingly, we present a robustness measure, which, with high probability, suggests an upper bound on the mean squared error of the network, with respect to an input data set, for a given perturbation of the input values of the network. The measure we propose is both simple and efficient to compute, treating the neural network as a black box. We provide experimental results on several real-world data sets showing the efficacy of the proposed method. We also introduce the concept of robustness curves, which allows us to further analyse robustness within and between data sets.
[LG-16] Introduction to Stochastic Differential Equations for Generative Machine Learning: A Variational Perspective
链接: https://arxiv.org/abs/2606.31576
作者: Ole Winther,Paul Jeha,Sander Dieleman,Andriy Mnih,Manfred Opper,Andrea Dittadi
类目: Machine Learning (cs.LG)
*备注:
Abstract:The use of ordinary and stochastic differential equations has led to substantial progress in generative machine learning with applications to, for example, image, video and biomolecule generation. This paper provides a self-contained and informal introduction to the differential equations, the probabilistic framework for using them in generative modeling and the Fokker–Planck equation that governs the temporal evolution of the marginal distribution of the stochastic variables of the differential equations. The variational lower bound on the log-likelihood (the evidence lower bound, ELBO) is derived and used as a general starting point for a discussion of diffusion models, score matching, and flow matching. All of these approaches may be viewed as specific parameterizations of the most general variational approach. A one-dimensional density modeling problem is used as a simple example to compare different parameterizations.
[LG-17] Beyond the Expressivity-Trainability Paradox: A Dynamical Lie Algebra Perspective on Navigating Barren Plateaus in Quantum Machine Learning
链接: https://arxiv.org/abs/2606.31536
作者: Kung-Ming Lan
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 3 figures
Abstract:As Quantum Machine Learning (QML) transitions toward practical implementation, the field faces a critical architectural bottleneck that challenges the fundamental assumptions of classical statistical learning theory. In classical deep learning, increasing model capacity typically risks overfitting. However, this study advances a counter-intuitive paradigm: unstructured contemporary QML architectures suffer from a profound state of quantum underfitting, driven by the “expressivity-trainability paradox.” We demonstrate that the vast Hilbert space capacity of Parameterized Quantum Circuits (PQCs)-traditionally chased as the source of quantum advantage is the direct mathematical cause of Barren Plateaus (BPs), where gradient landscapes become exponentially flat. By synthesizing recent breakthroughs in Dynamical Lie Algebras (DLAs) and Geometric QML, we establish a comprehensive framework linking the algebraic dimension of circuit generators to their optimization dynamics. Furthermore, we empirically validate this framework on a non-linear binary classification task, illuminating a uniquely quantum manifestation of the bias-variance tradeoff: while unstructured architectures achieve near-perfect training accuracy via unscalable parameterization (quantum overfitting), embedding group-theoretic geometric priors acts as a structural regularizer. By restricting the DLA growth to a polynomial regime, our symmetry-preserving approach sacrifices raw memorization capacity to guarantee scalable, gradient-rich training landscapes, offering a robust roadmap for “Trainability-by-Design” in scalable quantum neural networks.
[LG-18] Constrained Online Convex Optimization without Slaters Condition
链接: https://arxiv.org/abs/2606.31480
作者: Kihyun Yu,Junehee Lee,Dabeen Lee
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:We study constrained online convex optimization with adversarial losses and stochastic or adversarial constraints. For stochastic constraints, existing algorithms that achieve nearly optimal regret and constraint violation bounds typically rely on regularity assumptions such as Slater’s condition, while adversarial-constraint algorithms avoid these assumptions by using a rather restrictive round-wise feasible comparator. We bridge this gap with an anytime primal-dual framework that incorporates an adaptive regularizer into the dual update. The regularizer stabilizes the dual process without relying on the negative drift induced by Slater’s condition. For stochastic constraints and convex losses, our algorithm achieves O(\sqrtT) expected regret and O(\sqrtT\log T) expected cumulative constraint violation. Furthermore, we show that our algorithm also admits high-probability bounds of the same order on regret and constraint violation. For strongly convex losses, the regret bound improves to O(\log T) with a violation bound of the same order. With a minor modification, the framework also applies to adversarial constraints and provides guarantees for hard constraint violation.
[LG-19] abPATE: Differentially Private Tabular In-Context Learning Without Public Data ICML
链接: https://arxiv.org/abs/2606.31474
作者: Dariush Wahdany,Matthew Jagielski,Jesse C. Cresswell,Adam Dziedzic,Franziska Boenisch
类目: Machine Learning (cs.LG)
*备注: Presented at the 2nd ICML Workshop on Foundation Models for Structured Data (2026)
Abstract:Tabular foundation models enable accurate in-context learning (ICL) from small labeled datasets, but the private records placed in context can leak through model predictions. We first show that even basic membership inference attacks succeed against tabular ICL, motivating formal privacy protection. We then introduce TabPATE, a differentially private PATE-style defense for tabular ICL that does not require public in-distribution data. TabPATE partitions the private context across teacher models, privately aggregates their labels on synthetic tabular queries, and releases the resulting labeled queries as a student context. Because tabular features are bounded and relatively low-dimensional, useful queries can be generated from feature ranges alone or from lightly privatized marginals. Across tabular benchmarks, TabPATE preserves competitive utility while reducing membership inference to near-random success, providing a practical path to private tabular ICL without public data.
[LG-20] Zero-Shot Quantization for Object Detectors using Off-the-Shelf Generative Models ECCV2026
链接: https://arxiv.org/abs/2606.31456
作者: Hyunho Lee,Kyomin Hwang,Hyeonjin Kim,Suyoung Kim,Sunghyun Wee,Nojun Kwak
类目: Machine Learning (cs.LG)
*备注: Published at ECCV 2026
Abstract:With an increasing number of Object Detection (OD) models being deployed on edge devices, Zero-Shot Quantization for OD (ZSQ-OD) aims to quantize these models when access to the original training data is prohibited. Existing research on Zero-Shot Quantization-Aware Training (QAT) for OD synthesizes training sets through noise optimization. However, this approach struggles to maintain performance in low-bit regions. In this paper, we introduce GoodQ (Generative off-the-shelf models for object detector Quantization), a QAT pipeline that utilizes off-the-shelf generative models to construct a training set. We first identify three challenges that arise when introducing a generative model to the ZSQ-OD task: 1) each image contains dense information with multiple instances, 2) the class-wise distribution in the original dataset is imbalanced, and 3) the pseudo-labels assigned to the generated images can potentially act as noisy signals during QAT. GoodQ addresses these challenges by 1) introducing an Information-Dense Prompting strategy to generate multi-instance images, 2) applying Intrinsic Distribution-Aware Selection to match the pretrained class distribution, and 3) employing Teacher-guided Adaptive Noise Reduction to mitigate noise arising from the QAT process. Our framework achieves state-of-the-art performance in low-bit ZSQ (W4A4) and extends quantization to extreme bit-widths (W3A3). Furthermore, we conduct an extensive analysis to uncover the underlying factors contributing to the efficacy of GoodQ.
[LG-21] Contextual Slate GLM Bandits with Limited Adaptivity ICML2026
链接: https://arxiv.org/abs/2606.31449
作者: Tanmay Goyal,Sukruta Prakash Midigeshi,Gaurav Sinha
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted at ICML 2026
Abstract:We investigate the contextual slate bandit problem with generalized linear rewards under limited adaptivity. At each round, the learner is presented with N sets of items, where each item is represented by a d -dimensional feature vector. The learner then constructs a slate by selecting one item per set; the resulting slate yields a scalar reward sampled from a Generalized Linear Model (GLM). We propose algorithms under two limited-adaptivity settings: (a) Batched and (b) Rarely-Switching. For the batched setting, we introduce B-SlateGLinCB, which partitions the time horizon into \mathcalO(\log\log T) batches such that each batch’s policy relies only on data from previous batches. For the rarely-switching setting, we propose RS-SlateGLinCB, which adaptively performs only \mathcalO(Nd\log T) parameter updates. Under a diversity assumption on the item sequences, we prove that B-SlateGLinCB and RS-SlateGLinCB achieve regret bounds of \mathcalO(Nd^3/2\sqrtT) and \mathcalO(Nd\sqrtT) , respectively. Notably, both bounds are independent of the non-linearity parameter \kappa that is typically found to scale the regret of GLM bandit algorithms. Our algorithms are computationally efficient, requiring only \textpoly(N) time per round despite 2^\Omega(N) possible slates. Simulations show our algorithms outperform existing baselines with limited adaptivity and remain competitive with Slate-GLM-OFU, a fully adaptive state-of-the-art algorithm. Notably, a slightly modified B-SlateGLinCB empirically matches this baseline. Finally, we demonstrate strong performance in a practical in-context example selection task for language models.
[LG-22] Dualformer: Efficient Feature Extractor for Complex-valued Blind Communication Signal Analysis
链接: https://arxiv.org/abs/2606.31352
作者: Yurui Zhao,Xiang Wang,Jingreng Lei,Wanlong Zhang,Yik-Chung Wu,Zhitao Huang
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 18 pages, 11 figures
Abstract:Designing effective feature extractors is critical for blind signal analysis tasks such as automatic modulation recognition (AMR), signal scheme recognition (SSR), and \colorblack signal structure parsing (SSP). In this work, we propose dual-channel neural network (DualNN) that efficiently exploits complex-valued signals through parameter sharing across IQ channels. Unlike traditional real-valued or complex-valued models, DualNN is a groundbreaking framework which shares the network parameters for processing the real and imaginary parts of the complex-valued signals, and is theoretically shown to reduce generalization error while preserving expressive capacity. Specifically, we propose a novel Transformer-based architecture to implement DualNN, called Dualformer. The Dualformer segments input signals into patch-level tokens and captures multi-granularity features, enabling robust performance across diverse signal analysis tasks. Furthermore, we conduct extensive experiments comparing Dualformer with three Transformer-based baselines and four conventional DL-based approaches. Results demonstrate consistent performance improvements on AMR, SSR, and SSP tasks. Besides, the modular design of DualNN allows it to generalize well to blind signal processing tasks such as blind source separation and low-SNR spectrum sensing. This work paves the way for a broader application of DualNN architectures in unsupervised and weakly supervised complex-valued signal analysis scenarios.
[LG-23] Domain-Decomposed Randomized Neural Networks for Partial Differential Equations in Unbounded Domains
链接: https://arxiv.org/abs/2606.31342
作者: Haixin Wang,Haoning Dang,Fei Wang,Shimin Guo
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:Partial differential equations on unbounded domains are challenging because the exterior region must be represented without excessive truncation error. Truncation-based methods often require problem-dependent artificial boundary conditions, while global spectral bases may be inefficient for localized structures, irregular geometries, or solutions with different near-field and far-field behaviors. We propose a domain-decomposed randomized neural network framework for such problems. Different randomized subnetworks are assigned to different spatial regimes: a near-field subnetwork captures local and geometric features, whereas a far-field subnetwork represents exterior decay. The subnetworks are coupled by boundary and interface conditions, and only the output-layer coefficients are solved from linear least-squares systems arising from Petrov–Galerkin or collocation formulations. We develop a Petrov–Galerkin method for semi-unbounded elliptic problems and a collocation method for fully unbounded, perforated, and time-dependent problems. A conditional bounded-parameter approximation result is proved in a broken Sobolev norm, together with an error decomposition covering approximation, empirical-consistency/quadrature, and least-squares optimization errors. Numerical experiments for Poisson and time-dependent Schrödinger equations demonstrate the accuracy and flexibility of the proposed method.
[LG-24] Expected Gain-based Escalation in Vertical Federated Learning
链接: https://arxiv.org/abs/2606.31331
作者: Mohamad Mestoukirdi,Vincent Corlay
类目: Machine Learning (cs.LG)
*备注:
Abstract:Collaborative inference can improve predictive performance by integrating complementary information across agents, but applying collaborative fusion to every sample can incur unnecessary communication and computational overhead. This trade-off is particularly relevant in vertical federated learning (VFL), where clients observe different views of the same sample and fusion typically requires transmitting intermediate representations to a server. We study selective escalation in a two-round VFL inference protocol, in which a low-cost first round produces a prediction from client posteriors and a second embedding-fusion round is invoked only when it is expected to improve the final decision. We formulate routing as expected-gain score estimation: a sample is escalated when a predicted improvement in correctness justifies the additional communication. The proposed analytical score combines a calibrated pooled posterior with classwise reliability estimates of the VFL model, both obtained from held-out calibration data, yielding an interpretable router that requires no separately trained routing network. Experiments on multi-view classification benchmarks, including controlled test–time view degradation settings, show that the proposed router improves the communication-accuracy trade-off over confidence-, learned-gain-, and deferral-based baselines.
[LG-25] Safe Online Learning via Smooth Safety-Structured Policy Composition
链接: https://arxiv.org/abs/2606.31320
作者: Hongpeng Cao,Liqun Zhao,Yuliang Gu,Naira Hovakimyan,Lui Sha,Marco Caccamo
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注:
Abstract:Safe online reinforcement learning requires policies to respect safety constraints while maintaining smooth optimization dynamics. Existing approaches typically rely on either strict safety enforcement via action interventions, which introduce discontinuities in system interaction and learning, or soft safety constraint formulations, which preserve smooth learning but provide limited safety assurance. We propose AutoSafe, a safety-aware policy architecture that integrates structured safety monitoring and intervention directly into the action generation process. This design enables smooth, risk-dependent transitions between performance-driven and safety-preserving behaviors, resulting in continuous online interaction and learning dynamics. Empirical results across a suite of continuous-control benchmarks demonstrate strong safety enforcement without sacrificing learning smoothness. We further validate AutoSafe on a physical cart-pole system, highlighting its practical effectiveness for safe online learning in the real world.
[LG-26] Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry
链接: https://arxiv.org/abs/2606.31291
作者: Alexander Fabisch,Melvin Laux,Mariela De Lucas Álvarez,Edoardo Caroselli,Julian Theis
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep reinforcement learning has the potential to solve attitude control problems more adaptively, precisely, and robustly by handling nonlinear dynamics, uncertainties, and failure cases more effectively than traditional attitude control approaches. We explore reinforcement learning (RL) for attitude control in spacecraft re-entry. An industry-standard proportional-integral-derivative controller with gain scheduling serves as a strong baseline for model-free RL and hybrid controllers that combine these two approaches. We formalize the application in the RL framework to apply continuous, off-policy RL. State-of-the-art RL achieves comparable performance to traditional control approaches in this domain. However, its out-of-distribution generalization is not sufficient. Hence, we use dynamics randomization to introduce challenging task variations during training and enforce generalization in a predefined operational envelope. Finally, we assess the best obtained RL-based controllers with application-specific metrics to show superior performance in comparison to traditional controllers in the operational envelope, that is, hybrid controllers are able to track the angle of attack better and are more robust under variations of mass, inertia tensor, and flap actuator bandwidth.
[LG-27] Patch-PODiff-ViT: Structured Latent Diffusion with Patchwise POD for Super-Resolution and Uncertainty Quantification
链接: https://arxiv.org/abs/2606.31290
作者: Onkar Jadhav,Tim French,Matthew Rayson,Nicole L. Jones
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion models enable probabilistic super-resolution and conditional generation, but pixel-space methods are computationally expensive and learned latent spaces often lack interpretable uncertainty quantification. We introduce Patch-PODiff-ViT, a structured latent diffusion framework in which the latent space is defined by patchwise Proper Orthogonal Decomposition (POD), a fixed linear orthonormal basis over local patches, rather than learned by a nonlinear autoencoder. This yields low-dimensional, variance-ordered tokens that preserve spatial structure and enable efficient diffusion in a structured low-dimensional latent space with a Vision Transformer. Because the decoder is fixed, linear, and orthonormal, latent coefficient uncertainty can be propagated directly to physical-space predictive variance, enabling analytic propagation of predictive variance through the linear decoder without Monte Carlo estimation in pixel space. Across sea surface temperature, medical imaging, and natural images, the method achieves strong reconstruction with fewer parameters and lower memory, while producing well-calibrated spatial uncertainty that closely matches empirical ensembles.
[LG-28] Probabilistic Inversion with Flow Matching
链接: https://arxiv.org/abs/2606.31288
作者: Baldur Paulwitz,Stefan Buske
类目: Machine Learning (cs.LG); Probability (math.PR); Geophysics (physics.geo-ph)
*备注:
Abstract:We demonstrate the application of Flow Matching, a technique originating from generative Artificial Intelligence, to probabilistic inversion in geophysical settings, such as seismic Full-Waveform inversion. We adapt the well-established mathematical theory of Flow Matching from generative Artificial Intelligence to the context of probabilistic inversion. We evaluate the approach with two case studies: a simple 2D velocity model to illustrate the general features of the method, and the OpenFWI dataset to show its capabilities for probabilistic inversion of more complex seismic velocity models.
[LG-29] Sequential sparse Gaussian process quantile regression
链接: https://arxiv.org/abs/2606.31284
作者: Hugo Nicolas(PLATON, CMAP),Olivier Le Maître(PLATON, CMAP)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Quantile regression aims to estimate the conditional quantiles of a response variable from observed data. In a Bayesian setting, Gaussian process quantile regression provides uncertainty quantification but faces significant computational challenges due to the nonconjugacy of the asymmetric Laplace likelihood and the cost of posterior inference. We develop a sparse Gaussian process framework in which the quantile function is represented through a reduced set of inducing variables and posterior inference is performed using a Laplace approximation. A decomposition of the predictive uncertainty into conditional-prior and posterior-induced variance components is then exploited to drive two complementary adaptive mechanisms: inducing-input infilling and data acquisition. These mechanisms are combined within a sequential algorithm that allocates computational effort toward the dominant source of predictive uncertainty and adaptively controls model complexity. Numerical experiments on benchmark problems demonstrate the accuracy of the Laplace approximation, the benefits of variance-based inducing-input placement, and the effectiveness of the proposed sequential enrichment strategy compared with predefined data-acquisition strategies.
[LG-30] Revisiting the Volume Hypothesis ICML2026
链接: https://arxiv.org/abs/2606.31282
作者: Ari Pakman,Lior Kreimer,Yakir Berchenko
类目: Machine Learning (cs.LG)
*备注: Accepted to ICML 2026
Abstract:Modern deep neural networks often contain far more parameters than needed to fit their training data, yet they achieve impressive generalization. A common explanation for this success is the implicit bias of stochastic gradient descent (SGD). An alternative volume hypothesis posits that, within low training-loss regions, loss-landscape basins leading to strong generalization occupy much larger regions of weight space than basins that generalize poorly, and therefore SGD is simply more likely to land in the former. Recent experimental explorations of this idea present seemingly contradictory results. While in one set of experiments randomly sampling the network weights until achieving zero training error yielded poor generalization, molecular dynamics density estimates supported the volume hypothesis. We observe that these experiments were performed at different dataset size regimes, and explore an intermediate regime using the Replica Exchange Wang-Landau algorithm to estimate the joint density of states over training and test accuracies in binary networks. Across several architectures and datasets, we show that the generalization advantage of gradient learning over random sampling training generally diminishes as the training data size grows, suggesting a resolution of the paradox.
[LG-31] he Calibration Turn in AI-Assisted Research: A Conceptual and Methodological Framework for Evidence-Licensed Claims
链接: https://arxiv.org/abs/2606.31273
作者: Hongmin Li
类目: Machine Learning (cs.LG)
*备注: 42 pages, 4 figures. Companion code and synthetic simulation artifacts: this https URL
Abstract:AI-assisted research has entered a stage in which the central question is not only whether systems can generate hypotheses, run experiments, or produce manuscripts, but whether their scientific claims are calibrated to the evidence that supports them. This Perspective-style paper develops a conceptual and methodological framework for evidence-licensed claims in AI-assisted research. Motivated by representative routes including specialized scientific foundation models, LLM research assistants, multi-agent co-scientists, AI Scientist pipelines, mathematical discovery agents, and self-driving laboratories, it represents AI-assisted research as five operators: hypothesis generation, model-mediated consequence derivation, external validation, belief update, and claim calibration. The central claim is that calibration is not merely cautious wording but a mechanism for managing scientific assertion rights: evidence licenses some forms of speech and withholds others. The paper distinguishes linguistic, consequence-based, interventional, and evidence-licensed semantics; defines the claim-evidence gap and epistemic debt; and treats minimal structural reconstruction across heterogeneous outputs as an upward form of claim calibration. AISim-Cal is included as an illustrative synthetic dynamics exercise, not as an empirical forecast or benchmark. The resulting principles are: no claim without license, validation does not determine claim level, and automation amplifies the need for calibration. Reliable AI-assisted research is therefore evaluated as a loop that generates hypotheses, derives testable consequences, accepts independent adjudication, updates beliefs, and outputs only evidence-licensed claims.
[LG-32] Learning Gaussian Graphical Models from a Glauber Trajectory Without Mixing
链接: https://arxiv.org/abs/2606.31230
作者: Eric Shen,Tony Wu,Mahbod Majid,Ankur Moitra
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We study the task of learning the structure of a d -sparse Gaussian graphical model on n variables from a single trajectory of Glauber dynamics. Beyond algorithmic considerations, many applications present temporally correlated observations rather than i.i.d.\ samples. In the classical i.i.d.\ setting, under comparably general sparsity and minimum edge-strength assumptions, sublinear-in- n sample guarantees are known, but achieving them in polynomial-time remains open. Motivated in part by this gap, we give a polynomial-time algorithm that recovers the conditional-independence graph from a single Glauber trajectory, with a trajectory-length guarantee that does not depend on the mixing time. Technically, our algorithm has three components. First, we estimate the conditional variances and rescale the trajectory to reduce to the unit-diagonal case, without changing the underlying graph. Second, we design a local edge test that extracts adjacency information from short update windows by isolating pairwise influence. Third, we aggregate these local statistics using a robust median-based estimator, and prove accuracy despite temporal dependence arising from a single trajectory. Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) Cite as: arXiv:2606.31230 [cs.LG] (or arXiv:2606.31230v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.31230 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-33] Probing Memorization of Tabular In-Context Learning ICML
链接: https://arxiv.org/abs/2606.31208
作者: Francesco Capano,Jonas Böhler
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at 2nd ICML Workshop on Foundation Models for Structured Data, 2026
Abstract:Large tabular models (LTMs), i.e., tabular foundation models leveraging in-context learning (ICL), achieve state-of-the-art performance on tabular tasks. While LLMs are known to unintentionally memorize training data, the memorization dynamics of LTMs remain largely unexplored. We investigate the potential for parametric memorization in tabular ICL. We introduce ICLMEM, a probing framework designed to separate context-based predictions from parametric memorization. Our zero-information multiple-choice context strips away valid contextual patterns to force the model to fall back on its parametric memory. Our controlled fine-tuning setup establishes membership ground truth and accounts for common pitfalls, e.g., distribution shift, feature contamination, base-rate fallacy, and the pre-trained base model acts as reference to calibrate for sample difficulty. Our controlled evaluation on a leading real-world-trained LTM detects moderate memorization signals in 8 out of 10 tasks ( \textAUC up to 0.67 and TPR at 1% FPR 0.1 ). Notably, memorization signals are strongest for low-cardinality and binary tasks. However, they largely vanish under realistic training conditions. Our findings show LTM memorization signals under specific circumstances (single-task fine-tuning with fixed samples across many epochs and small query size). To protect sensitive data, appropriate measures must be taken, which we discuss.
[LG-34] Machine Learning-based Feedback Linearization Control of Quadrotor Subject to Unmodeled Dynamics
链接: https://arxiv.org/abs/2606.31199
作者: Amos Alwala,Gabriel da Silva Lima,Wallace Moreira Bessa
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: This paper is part of the EURODINAME III proceedings ( this https URL )
Abstract:The control of agile quadrotors in dynamic and uncertain environments remains an open area of investigation to this day, particularly when the complete system dynamics are partially known or highly nonlinear. This work introduces a novel machine learning-based feedback-linearization control framework that employs a Gaussian Radial Basis Function (RBF) neural network (NN) to model and compensate for unmodeled dynamics in real time. The proposed controller leverages the universal approximation capability of RBF networks to model nonlinearities and uncertainties. An online adaptation of the RBF NN updates the network’s weights without prior training. The control law is derived using the Lyapunov stability theory, herein guaranteeing closed-loop stability and providing theoretical guarantee of asymptotic convergence of a trajectory tracking task. Gazebo simulation and real flight experiments are conducted using the Bitcraze’s Crazyflie 2.1 quadrotor subject to unmodeled air drag, actuator dynamics, and external disturbance. Despite incomplete knowledge of prior dynamics and presence of external disturbance such as air drag and drift in state estimation, the proposed controller improves trajectory tracking with rapid convergence and reduction of position-norm and yaw orientation RMSE by more than 7.13% and 49.27% respectively compared to baseline feedback linearization controller.
[LG-35] ISM:Self-Improving Strategy Memory for Continual Mathematical Reasoning ICML2026
链接: https://arxiv.org/abs/2606.31191
作者: Prakhar Dixit,Tim Oates
类目: Machine Learning (cs.LG)
*备注: 3rd AI for Math Workshop at ICML 2026 Forty-Third International Conference on Machine Learning
Abstract:We propose Intelligent Schema Memory (ISM), a self-evolving memory-augmented system that improves mathematical reasoning for a frozen LLM under continual learning with hard episodic resets. ISM maintains a compact, self-refined bank of strategy schemas learned from both successful and failed episodes, with symbolic tools that check intermediate steps and certify this http URL updating model parameters, ISM outperforms passive, retrieval, and reflection baselines on MATH-Hard and OlympiadBench, using 64% and 86% fewer schemas respectively than the strongest passive baseline. These results show that small, actively maintained, and verified strategy memories can support reliable continual mathematical reasoning under strict episodic isolation.
[LG-36] Probe Choice Changes Canary-Memorization Verdicts: Three Post-Hoc Disagreement Case Studies in a Text-Dominant LoRA-Tuned Autoregressive Testbed ICML2026
链接: https://arxiv.org/abs/2606.31168
作者: Zhichao Fan,Zexin Zhuang,Yanhang Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: ICML 2026 FoGen Workshop camera-ready. 17 pages, 4 figures, 12 tables
Abstract:We audit a fixed prefix-window mean-NLL memorization probe (K=20) on a Qwen2.5-VL-7B canary testbed and report three post-hoc cases where it disagrees with full-span secret NLL or greedy exact-recall. C3 (false negative, window truncation): damage lands on hex tokens outside K=20; the probe stays flat while hit@1 drops. C4 (false positive, non-secret drift): the probe moves, but approximately 99% sits on non-secret preamble; the secret span and hit@1 are unchanged. C5 (ambiguous in-window drop): the probe falls on an undertrained baseline while full-span hex is positive and hit@1=0. Recommendation: report (i) full-span secret NLL, (ii) a span-localised decomposition, (iii) behavioural exact-recall at k=4, and (iv) decoy probes before asserting secret-specificity. Evidence is on controlled canaries in one backbone; magnitudes are testbed-specific.
[LG-37] An Empirical Study of Security Calibration in Large Language Models for Code
链接: https://arxiv.org/abs/2606.31159
作者: Mohammed Latif Siddiq,Md. Nafiu Rahman,Joanna C. S. Santos
类目: oftware Engineering (cs.SE); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Accepted at the 42nd International Conference on Software Maintenance and Evolution (ICSME 2026) Research Track
Abstract:Large Language Models (LLMs) are rapidly transforming software development, yet their use in security-critical contexts raises a key question: do models know when their generated code is insecure? This property, known as calibration, measures whether a model’s confidence aligns with the true correctness of its outputs. We present the first large-scale empirical study of security calibration in LLM-generated code. We evaluate GPT-4o-mini, Gemini-2.0-Flash, and Qwen3-Coder-Next across multiple temperature settings on two complementary benchmarks: self-contained security tasks and multi-language repository-level contexts. Our results suggest that overconfidence is prevalent across the evaluated LLMs. Functional calibration is consistently worse than security calibration, suggesting that models estimate security outcomes more reliably than functional correctness, potentially because functional correctness depends on complex execution behavior. We also examine whether calibration-guided automated repair can help remediate vulnerabilities in LLM-generated code, finding only limited improvements while frequently introducing functional regressions. Moreover, we study different mitigation strategies for reducing False Trust, where models assign high confidence to vulnerable code. The results show that although architectural gating improves calibration on controlled benchmarks, calibration deteriorates in realistic repository-level settings, increasing the risk of high-confidence vulnerable outputs.
[LG-38] A Bayesian Filtering Approach for Learning Lagrangian Dynamics from Noisy Measurements
链接: https://arxiv.org/abs/2606.31137
作者: Kundan Kumar,Shreya Das,Simo Särkkä
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 1 figure, 2 tables
Abstract:This paper proposes a Bayesian filtering-based approach for learning the dynamics of a physical system from partial, noisy measurements. We model the system dynamics using a Lagrangian mechanics formulation. As in Lagrangian neural networks (LNNs), we parameterize the kinetic and potential energies with neural networks. The unknown external forces in the Lagrangian formulation are modeled as white Gaussian noise. The corresponding Euler–Lagrange equations then yield a continuous-time stochastic state-space model (SSM) that describes the system dynamics. The neural network parameters and system states are then jointly learned via a maximum-likelihood method using Gaussian-approximation-based Bayesian filters. The effectiveness of the proposed method is demonstrated on pendulum and Duffing oscillator examples, and its performance is compared with conventional LNNs and with approximate Bayesian filters using known system models.
[LG-39] Can Tabular In-Context Learners Generalize to Biomolecular Property Prediction?
链接: https://arxiv.org/abs/2606.31126
作者: Davy Guan,Lu Zhang,Asiri Wijesinghe,Allen Zhu,He Zhao,Helen Power,F. Hafna Ahmed,Andrew Warden,Cheng Soon Ong,Daniel M. Steinberg
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Machine Learning (stat.ML)
*备注:
Abstract:Predicting biomolecular properties from limited labeled data is a central bottleneck in protein engineering and small-molecule design. As strong pretrained encoders now supply rich fixed-length representations, the difficulty has shifted from representation learning to building a data-efficient predictor for the few-shot regime. Tabular foundation models such as TabPFN3 and TabICL are unlikely candidates for this role: they are in-context learners pretrained on synthetic tables drawn from random causal graphs, a generative prior with no obvious correspondence to the processes that produce protein sequences or molecular graphs. That this tabular, causal inductive bias should transfer to biomolecular data at all is unintuitive, yet we find it does. Treating each method as a predictor-representation pair, we evaluate across two domains. Over a fixed ESMC representation, tabular in-context learning is consistently competitive for protein fitness regression on ProteinGym and a diverse esterase dataset. For small-molecule classification with ECFP/RDKit descriptors, no single pairing dominates across TDC ADMET, MoleculeNet, FS-Mol, and DrugOOD; representation choice becomes a primary determinant, as expected when the predictor’s own prior is indifferent to molecular structure. We conclude that tabular foundation models are strong performers on biomolecular prediction tasks, but that their performance depends strongly on the sequence or molecular representation used.
[LG-40] Visualizing High-Dimensional Graph Embeddings via Informed Multi-View Projections
链接: https://arxiv.org/abs/2606.31119
作者: Ya Ji(1),Xuefeng Li(1),Timo Brand(2),Jacob Miller(2),Peng Zhang(1),Stephen Kobourov(2),Yifan Hu(1) ((1) Khoury College of Computer Sciences, Northeastern University, Seattle, (2) School of Computation, Information and Technology, Technical University of Munich, Heilbronn, Germany)
类目: Machine Learning (cs.LG)
*备注: 18 pages, 13 figures
Abstract:Graphs are commonly visualized in 2D, where humans readily interpret spatial relationships, yet such layouts often distort higher-dimensional structure. We propose to embed graphs in high-dimensional space and search for informative 2D viewpoints that optimize aesthetic and readability metrics (e.g., edge crossings and angular resolution), enabled by a novel differentiable surrogate for edge crossings. Numerical experiments show that these viewpoints consistently outperform standard 2D layouts, and can even surpass methods explicitly designed to optimize these metrics. We further introduce DataFly, an interactive system for exploring multiple candidate viewpoints through seamless navigation. A usability study demonstrates that our approach reveals structural patterns that remain hidden in conventional 2D visualizations.
[LG-41] Explaining Machine Learning and Memorization with Statistical Mechanics
链接: https://arxiv.org/abs/2606.31110
作者: Robin Theriault
类目: Machine Learning (cs.LG); Statistical Mechanics (cond-mat.stat-mech)
*备注: PhD thesis defended on January 15, 2026. Supervisor: Daniele Tantari. Committee: Elena Agliari, Aurelien Decelle, Daniele Tantari, Fosca Gianotti, Fabrizio Lillo
Abstract:Artificial neural networks (NNs) and machine learning (ML) algorithms are poorly understood from a theoretical perspective, which makes it difficult to fully realize their potential and overcome their weaknesses. For instance, ML algorithms train NN weights by moving them along a low-dimensional subspace of their allowed values, but this implicitly low-dimensional learning structure is not properly exploited to improve training because its nature is not well understood. Moreover, trained NNs are easily confused by pervasive adversarial attacks whose theoretical underpinnings are still unclear. This thesis aims to improve our theoretical understanding of NNs and ML, with a particular focus on adversarial attacks and implicitly low-dimensional learning. For this purpose, we use mathematical tools from statistical mechanics to study different types of NNs and ways in which they can fit the data. In particular, we study two classes of models that fit the data with various degrees of learning and memorization: dense associative memory (DAM) and restricted Boltzmann machines (RBM). In the process, we investigate connections between different versions of these models that are useful to make analytical investigations more efficient.
[LG-42] Fora: From Weight-Space to Function-Space Protection in Capability-Preserving Fine-Tuning
链接: https://arxiv.org/abs/2606.31092
作者: Rui Zhou,Tianci Xie
类目: Machine Learning (cs.LG)
*备注:
Abstract:Full fine-tuning adapts large language models to new tasks but can erode capabilities they already possess. Existing remedies protect through proxies such as parameter distances, importance penalties, output matching, or dominant singular directions of the weights, but none directly asks which activation directions the preserved capability relies on. We argue that a capability is characterized more faithfully by the activation subspace it induces than by the singular geometry of the weight matrix, and develop function-space protection, instantiated as FORA (Function-space Orthogonal Residual Adaptation). From label-free calibration inputs, FORA estimates, per layer, the principal directions Q of the input-activation covariance and forms a right projector P_Q = I - QQ^T . Paired with a left projector P_U from the weight SVD, the update is \Delta W = P_U M P_Q + U_2 D_\delta V_2^T : a high-capacity branch structurally barred from reading capability-relevant function directions, plus a narrow spectral channel for controlled plasticity. The construction extends to parameter-efficient adaptation via M \to (\alpha/r) BA . Across three settings on Qwen3-1.7B, including COGS and GSM8K learned while preserving translation and translation learned while preserving math, FORA consistently improves preservation over weight-space projection and standard regularization, with only a small new-task trade-off in the math-preservation setting. A controlled ablation isolating the projection source shows that the advantage comes not from projection itself, but from projecting onto capability-derived rather than weight-derived directions. Code is available at this https URL.
[LG-43] Warp RL: Reshaping Base Policy Distributions for Dynamics Adaptation
链接: https://arxiv.org/abs/2606.31043
作者: Ethan Hirschowitz,Fabio Ramos
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 17 pages, 7 figures
Abstract:Residual reinforcement learning adapts a pretrained robot policy by learning an additive correction to its actions. While effective when adaptation amounts to shifting the base policy’s action distribution, additive corrections cannot change the distribution’s shape, scale, or state-dependent geometry – limitations we formalize as wrong variance, miscalibrated confidence, and non-uniform correction. We show that these matter under dynamics shift: when the base distribution is geometrically mismatched to the shifted system, residual correction can underperform even the unadapted policy. We propose \textbfWarp RL, a policy adaptation method that replaces additive residuals with an invertible, state-conditioned transformation of the base policy’s action distribution. Instantiated with monotonic rational-quadratic spline flows [arXiv:0706.1234v1], Warp RL preserves identity initialization, strictly generalizes additive residual correction, and exposes a structured adaptation space suitable for both policy-gradient and gradient-free optimization. Across a variety of ManiSkill3 manipulation tasks with controlled dynamics shifts, Warp RL matches residual correction when translation is sufficient and substantially outperforms it when adaptation requires distributional reshaping. We further demonstrate that warping can replace additive correction in an off-policy sim-to-real pipeline, achieving comparable success rate with 30% faster task completion on a real-robot peg-insertion task.
[LG-44] aching LLM s to Recommend and Defer in Underrepresented Epilepsy Care
链接: https://arxiv.org/abs/2606.31036
作者: Shreyas Rajesh,Kartik Sharma,Tonmoy Monsoor,Mehmet Yigit Turali,Richard Idro,Juliana Kayaga,Robert Sebunya,Tracy Tushabe Namata,Jessica Nichole Pasqua,Vwani Roychowdhury,Rajarshi Mazumder
类目: Machine Learning (cs.LG)
*备注: 34 pages, 8 figures
Abstract:Specialist epilepsy expertise is scarce in resource-constrained settings, making LLM-based decision support attractive for frontline clinicians managing longitudinal treatment. Such systems must adapt to local prescribing practice and know when to defer. We study this problem in Ugandan pediatric epilepsy care, predicting anti-seizure medication regimens from longitudinal unstructured clinic notes. Standard prompting achieves non-trivial agreement with physician prescriptions, but neurologist review shows that many errors reflect distribution-miscalibrated prescribing defaults rather than failures to parse the local record. We introduce MANANA, a non-parametric prompt-learning framework that learns local prescribing guidance from a small patient-level training set. MANANA converts observed prescription errors into auditable prompt memories, instantiated in single-agent and multi-agent variants, and improves over classical ML models, direct LLM prompting, and prompt-optimization baselines across two independently collected Ugandan cohorts. We further propose Bayesian prompt averaging, which converts the learned prompt trajectory into prescription likelihoods and an uncertainty-based deferral signal. On the independently collected held-out cohort, this improves visit-level top-3 prescription accuracy by 4-8 percentage points over prompt-optimization baselines and enables selective prediction: the system can auto-handle the most confident half of cases at 95% precision, or the most confident quarter at 99% precision, while deferring lower-confidence cases for specialist review.
[LG-45] Offline Reinforcement Learning for Fluid Controls: Data-based Multi-observational Policy Extraction
链接: https://arxiv.org/abs/2606.31025
作者: Deepak Akhare,Luning Sun,Xin-Yang Liu,Xiantao Fan,Timo Bremer,Ben Zhu,Jian-Xun Wang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Active flow control is a fundamental application in engineering. Recent advances in deep reinforcement learning have made progress in this field. However, the classical online RL approaches require extensive real-time interactions with the high fidelity environment, while each sensor configuration change necessitates whole policy retraining. All these factors result in prohibitive computational costs for real-world applications. In this work, we propose a novel offline RL framework that addresses both challenges through data-driven policy extraction. We develop a sensor position-conditioned architecture that enables a single policy network to adapt seamlessly to multiple sensor arrangements. The position-conditioned approach incorporated spatial relationship modeling through Point Attention layers to ensure the generalizability to varying sensor placements. We demonstrate the framework on two representative problems, mitigating chaoticity in the Kuramoto-Sivashinsky equation and flow control over airfoils governed by the Navier-Stokes equation. The result demonstrates that the policy extraction from the dataset provides unprecedented flexibility for sensor placement optimization. This approach represents a significant step towards adaptive, intelligent flow control systems.
[LG-46] Certified Speculative Execution for Untrusted AI Agents
链接: https://arxiv.org/abs/2606.31023
作者: Chenyu Zhou,Qiliang Jiang,Shuning Wu,Xu Zhou
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: 15 pages, 2 figures, 24 tables. Includes a technical appendix (full proofs and all supplementary tables)
Abstract:Hard-constrained sequential decision systems have no certified way to spend the test-time compute of modern AI: executing the multi-step drafts of a learned policy or a frozen LLM forfeits the feasibility guarantee a trusted solver provides, while invoking the solver at every step forfeits the speed the AI offers. Certificate-Gated Prefix Acceptance (CGPA) closes this gap with a certified speculative-execution contract for untrusted AI agents: a trusted verifier rejects constraint-violating transitions exactly, a conformally calibrated value boundary gates the longest low-cost prefix within a per-segment regret budget, and the rest defers to the solver, so safety, regret, and speed decouple by construction. The contract drives every untrusted proposal source - adversarial drafters and six heterogeneous frozen LLMs (including a 12B model that violates constraints in 98% of direct rollouts) - to zero applied violations; a certificate-aware learned boundary, conformally calibrated, drives mean regret three orders of magnitude below unguarded acceptance, to within sampling noise of the stepwise oracle (95% CI spanning zero), and under calendar shift a learned proposal source overtakes it on 15 of 18 held-out days. On a deployment-scale unit-commitment instance it turns a frozen 8B LLM into a 2.96x per-episode wall-clock speedup at 2.1% regret, outpacing the domain heuristic (1.79x) and a safe receding-horizon baseline (1.07x): the more capable the untrusted source, the faster the certified system, at guarantees that never change.
[LG-47] Estimating Supply Incrementality in Two-sided Marketplaces: A Causal Machine Learning Approach KDD2025
链接: https://arxiv.org/abs/2606.30999
作者: Yufei Wu,Daniel Schmierer,Dan Zylberglejd
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Applications (stat.AP); Methodology (stat.ME)
*备注: 5 pages, 3 figures. Accepted at the KDD 2025 Workshop on Causal Inference and Machine Learning in Practice (not presented)
Abstract:In two-sided marketplaces with heterogeneous products, it is important to understand the causal relationship between additional supply and marketplace outcomes, such as the total quantity transacted or transaction value in the marketplace. This paper studies a causal machine learning approach to estimating this relationship across product segments. We use the Airbnb marketplace as an example, focusing on the impact of additional listing supply on total bookings, but the methodology applies to other two-sided marketplaces. Our approach combines double/debiased machine learning with a hierarchical Bayesian framework that leverages pre-existing knowledge as priors. We construct tractable and informative features for the model by leveraging measures of product segment similarity from the geospatial literature. We find that such a model provides plausible estimates of the marketplace returns to additional supply and strong out of sample performance.
[LG-48] Multistage Defer Trees for Hybrid Interpretability: If at First You Cant Succeed Tree Again
链接: https://arxiv.org/abs/2606.30995
作者: Zakk Heile,Hayden McTavish,Margo Seltzer,Cynthia Rudin
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent work has shown that well-optimized individual decision trees can match complex black box models in some settings, primarily in noisy domains. For the remaining settings, however, complex ensembled compositions of trees often achieve higher accuracy at the cost of interpretability, leaving practitioners with difficult modeling decisions along an accuracy-interpretability tradeoff. Ideally, we would like to classify as much of the data as possible with one or a small number of trees, achieving interpretability for most samples while maintaining state-of-the-art accuracy. We introduce Multistage Defer Trees: a sequence of sparse decision trees that each make predictions for most samples, while deferring a small proportion to the next tree in the sequence or, ultimately, to a black box. We demonstrate that we can train this model class to match the performance of complex tree-based ensembles while routing most samples through only one or a small number of sparse decision trees. We discuss a range of techniques for training these models while maintaining simplicity. Our method expands the accuracy–interpretability frontier in settings where single-tree methods remain insufficient, demonstrating that even when complex models are necessary, they need not be fully opaque.
[LG-49] ShardNet: Training Neural Controllers with Hard Non-Convex Constraints
链接: https://arxiv.org/abs/2606.30935
作者: Long Kiu Chung,Shreyas Kousik
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures
Abstract:While neural network control policies are powerful, their deployment on safety critical systems depends on ensuring that they obey strict constraints. Existing work often treats safety as a metric to optimize for, which competes with other performance objectives, if training converges at all. Instead, we introduce ShardNet, a neural network architecture that strictly enforces unions of polyhedral constraints by construction, using a differentiable projection layer parameterized by a classification network. The key insight is to embed safety into the neural network’s structure, allowing performance to be optimized independently because formal safety guarantees are always given. In contrast with existing neural architectures that can only enforce simple convex constraints, ShardNet enables the first safe-by-construction synthesis of forward-invariant neural network controllers on closed-loop systems where safety constraints are expressed as nonconvex unions of polyhedras or learned value function level sets. To support this, we also introduce a technique to verify and train such value functions correctly as rectified linear unit (ReLU) networks, which has not previously been possible. On double integrator benchmarks drawn from the literature, ShardNet policies maintain 100% safety on verified sets and achieves significantly lower objective loss compared to existing formal methods. Furthermore, our value function training technique also produces safe sets more than 3 times larger than existing verification approaches.
[LG-50] Quality-Aware Modulation for Diffusion Transformers
链接: https://arxiv.org/abs/2606.30934
作者: Luke Budny,Yuhong Guo,Kevin Cheung
类目: Machine Learning (cs.LG)
*备注:
Abstract:Modern text-to-image diffusion models, such as diffusion transformers (DiT), rely on timestep or prompt embeddings to modulate the strength of the denoising process in each timestep. While this modulation communicates the current noise level, it does not provide any quality-aware information, which can lead to generated images that are unaligned, visually inconsistent, and lacking in fidelity. In this paper, we propose the Quality Representation Module (QRM), a lightweight transformer module that learns a quality-aware representation based on existing model inputs, and produces a set of vectors M_qrm . These vectors adjust the adaptive LayerNorm modulation within the DiT transformer blocks, thereby injecting a quality-sensitive signal into the denoising parameters. The QRM introduces no significant changes to the sampling schedule or diffusion backbone. Experiments include ablations on QRM training losses and architectures, as well as empirical results demonstrating consistent image quality improvements over baseline DiT-based models.
[LG-51] Personalizing Marketplace Policies with Competing Objectives and Constrained Experiments: Evidence from a Job Marketplace KDD2026
链接: https://arxiv.org/abs/2606.30932
作者: Yufei Wu,Zhen Yan
类目: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注: 10 pages, 6 figures. Accepted at ACM SIGKDD 2026 (Applied Data Science Track)
Abstract:Two-sided marketplaces connect distinct user groups whose interests often conflict – improving outcomes on one side could degrade the other side’s experience. To address this challenge, we deploy an integrated framework for personalizing free-value thresholds – a policy governing the scope of complimentary services for job listings – across a two-sided job marketplace connecting millions of employers and job seekers. Our personalized policy delivers statistically significant and economically sizable lift in the target metric while respecting engagement guardrail constraints. Direct application of standard uplift methods proves insufficient here for two reasons. First, cross-side externalities demand multi-objective optimization: maximizing employer-side metrics risks harming job seeker engagement, with effects varying substantially across job segments. Second, marketplace interference necessitates cluster-level randomization, limiting us to few discrete treatment levels – effectively a form of positivity violation that rules out methods designed for continuous treatments. We contribute an integrated framework with three components. Our ensemble-based hybrid ranking models target and guardrail metrics separately, cutting guardrail risk by over 10% for equivalent target gains compared to single-objective approaches. A treatment effect extrapolation method extends our estimates from limited experimental variation to untested policy levels, relying on monotonicity assumptions that we validate empirically. Finally, we present production deployment, where post-launch data confirms both extrapolation accuracy and guardrail compliance. Our deployed system demonstrates that principled methodology can enable meaningful personalization even when experiments are severely constrained and different objectives compete – common conditions that characterize many real-world marketplaces. Comments: 10 pages, 6 figures. Accepted at ACM SIGKDD 2026 (Applied Data Science Track) Subjects: Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME) Cite as: arXiv:2606.30932 [cs.LG] (or arXiv:2606.30932v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2606.30932 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3770855.3818460 Focus to learn more DOI(s) linking to related resources Submission history From: Yufei Wu [view email] [v1] Mon, 29 Jun 2026 21:34:37 UTC (11,071 KB)
[LG-52] A Systematic Approach to Multi-Agent AI from Advanced Regulatory Control Theory: Safe and Auditable LLM Operator Agents for Process Control
链接: https://arxiv.org/abs/2606.30877
作者: Idelfonso B. R. Nogueira,Sigurd Skogestad
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注:
Abstract:Recent literature shows that large language models (LLMs) are useful for general-purpose tasks yet perform poorly on specific domain ones. One reason is the difficulty of supplying narrow context to a general-purpose model and of bounding the task it is asked to perform. It is possible to hypothesise that a multi-agent reformulation under process-control principles offers a route to address those points, since control theory provides a discipline of decomposing a system into elements of contained scope, each defending one controlled variable, with conflicts resolved by structural priority: MIN/MAX selector networks for CV-CV switching and split-range (split-parallel) logic for MV-MV switching. The present work proposes such a reformulation, derived from Advanced Regulatory Control (ARC) theory. Each feedback loop in the ARC chain is mapped to one specialised LLM operator agent carrying the loop’s control-theoretic context (controlled variable, setpoint, chain priority, selector kind). The chain’s interaction logic (MIN/MAX selectors, override paths) is encapsulated as a single orchestrator agent. Two orchestrator variants are tested: a deterministic rule chain, and a Claude-based LLM orchestrator at a slower tier. The control principles limit each agent’s task and inform how its limitations are handled. The multi-agent system inherits the safety property of the ARC chain: every constraint conflict is resolved deterministically by the orchestrator, regardless of the LLM output. Evaluated on a dairy-barn ventilation case over a 4-day mixed-season scenario, Qwen 2.5 7B Instruct operator agents running offline on a 24 GB consumer GPU at a 5-minute cadence produce auditable trajectories, each paired with an operator-voice rationale that supports a control campaign logbook.
[LG-53] A Transferable Learned Temporal Prior for Transmission Reconstruction and Decision-Relevant Uncertainty in Real Outbreak Labels
链接: https://arxiv.org/abs/2606.30842
作者: Md Ahsan Karim
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 30 pages, 7 figures, 15 tables, 2 algorithms, 26 references
Abstract:Outbreak transmission reconstruction treats epidemiological timing and transmission labels as deterministic ground truth; neither has been systematically evaluated. We trained a logistic regression temporal prior on eleven disease families, locked all parameters before accessing any target outbreak data, and applied it without refitting to a strict Andes virus (ANDV) parent-ranking benchmark of 29 tasks. The locked prior achieved mean reciprocal rank (MRR) 0.571 versus 0.274 and Top-1 accuracy 37.9% versus 13.8% against the best source-trained parametric baseline (permutation p = 0.0002; 7-8 reversals to lose MRR significance). A phylogenetic concordance audit of 75 NYC mpox inter-host pairs - independent label-reliability evidence rather than a prior validation - found that 54.67% (exact 95% CI: 42.75-66.21%) were genomically unresolved or unsupported. Retaining uncertain edges in ANDV and Guangdong Delta graphs shifted top-5 source-priority sets (Jaccard 0.429-0.667). Transmission-label uncertainty was measurable in the outbreak evidence modules examined, and retaining uncertain links changed which source cases were prioritized for intervention.
[LG-54] Partition-Guided Distance Saliency: Bridging Decision and Objective Spaces in Many-Objective Optimization
链接: https://arxiv.org/abs/2606.30836
作者: Cláudio Lúcio do Val Lopes,Flávio Vinícius Cruzeiro Martins,Elizabeth Fialho Wanner
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: The 4th World Conference on eXplainable Artificial Intelligence 01-03 July, 2026 Fortaleza, Brazil Building transparent AI
Abstract:Explainability in Many-Objective Optimization (MaO) is currently hindered by the escalating complexity of the Pareto front, which renders the relationship between high-dimensional decision variables and objective outcomes increasingly opaque. As the number of objectives exceeds the limits of traditional visualization, decision-makers encounter a cognitive drought'' in identifying relevant trade-offs or specifying target regions without a priori knowledge. To bridge this interpretability gap, we introduce the Partition-Guided Distance Saliency (PGDS) framework, a novel XAI approach designed for continuous optimization landscapes. Our framework automates the explanation process through a three-stage pipeline that prioritizes geometric intuition over abstract rules. First, we employ a surrogate model that learns how geometric distances in the decision space map to proximity in the objective space. Second, to address the difficulty of manual target selection in high dimensions, the framework automatically partitions the objective landscape into distinct regions and identifies local Dominating Points’’ to serve as automated targets for improvement. Third, we quantify how sensitive a solution’s position is to each decision variable by measuring the distance shifts induced by perturbations to each variable. This allows PGDS to categorize features as either Drivers'' which facilitate convergence toward preferred regions, or Blockers’’ which represent geometric constraints hindering further progress. Validation on 10-objective benchmarks and a physics-informed engineering problem (Welded Beam) demonstrates that PGDS provides differentiated, actionable insights that traditional visualization and rule-based XAI methods fail to provide.
[LG-55] Mind the Residual Gap: Probabilistic Downscaling under Real-World Bias
链接: https://arxiv.org/abs/2606.30821
作者: Yujin Kim,Nidhi Soma,Sarah Dean
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Probabilistic downscaling is the task of modeling the conditional distribution of high-resolution fields given coarse inputs, and is a central challenge to atmospheric science, climate modeling, and other multiscale physical systems. A widely used paradigm decomposes the problem into a deterministic mean predictor followed by a stochastic residual generator. While effective in idealized settings, this mean–residual approach frequently produces biased and under-dispersive ensembles in real-world applications. Is this merely generic predictive uncertainty miscalibration? We show that the root cause is more fundamental: residual target misspecification, the residual distribution induced during training differs systematically from the one required at test time due to downscaling bias. To close this gap, we introduce ReMatch (Residual Distribution Matching). ReMatch aligns the training residual distribution toward the test-time regime via optimal transport in a low-dimensional PCA space. This preserves the statistical benefits of the mean–residual framework while reducing the train–test mismatch in the residual targets seen by the stochastic generator. On a controlled synthetic benchmark with varying bias levels and a real-world HRRR–ERA5 wind field downscaling task, ReMatch substantially reduces under-dispersion, improves calibration (SSR and CRPS), and outperforms strong baselines, including the standard mean–residual model and its variants, as well as state-of-the-art super-resolution models. Our code is available at this https URL.
[LG-56] Predictable GRPO: A Closed-Form Model of Training Dynamics
链接: https://arxiv.org/abs/2606.30789
作者: Rajat Ghosh,Datta Nimmaturi,Aryan Singhal,Vaishnavi Bhargava,Henry Wong,Johnu George,Debojyoti Dutta
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Group Relative Policy Optimization (GRPO) has become a standard tool for improving the reasoning ability of large language models, yet its training dynamics are still described empirically: reward trajectories are fit with low-parameter functional forms whose constants carry no mechanistic meaning, and hyperparameter choices remain a matter of trial and error. We develop a first-principles reduced-order model of these dynamics. The reduction has three consequences. First, it subsumes the empirical single-exponential saturation law as its overdamped limit, recasting the fitted plateau, timescale, and size exponent as the fixed point, inverse stiffness, and curvature-scaling exponent of the underlying potential, and adding, through the retained inertial term, the slow-start phase the single exponential cannot represent. Second, it yields predictions tied to independently measurable quantities rather than fitted ones: group-size invariance of the deterministic trajectory with a 1/G stationary fluctuation, a sharp stability threshold in the refresh interval, and an overdamped-to-oscillatory transition. Third, it furnishes diagnostics that separate failure modes a reward curve alone conflates – reward hacking, advantage degeneracy, policy concentration, and dynamical instability. Across three models and two group sizes, the closed-form trajectory fits training reward to R^2 \geq 0.91 and the predicted group-size invariance holds on both the reward curve and out-of-distribution transfer to eight math benchmarks. The stability and oscillatory predictions are exercised in a controlled exact-reduction setting where the mean-field assumption holds exactly: a softmax-bandit reduction reproduces the predicted overdamped-to-oscillatory transition and locates the refresh-interval stability threshold at the independently measured stiffness, with a deep-network demonstration left to future work.
[LG-57] ReactionAtlas: Ab origine exploration of chemical reaction networks with machine learning
链接: https://arxiv.org/abs/2606.30778
作者: Stefan Gugler,Max Eissler,Khaled Kahouli,Klaus-Robert Müller
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注:
Abstract:Mapping a chemical reaction network, the graph of minima and transition states (TS) and the elementary reactions connecting them, is the natural language of chemistry, from catalysis to combustion to the origin of life. Constructing such a reaction network for a given chemistry has been impractical: it requires finding and characterizing tens of thousands of TS, a task for which traditional methods such as density functional theory (DFT) are typically prohibitively slow and require reactant and product as input. We introduce ReactionAtlas, which builds a reaction network \textitab origine from a handful of seed molecules and without hand-crafted rules. Specifically, our machine-learned generative model proposes reactions from kinetically sampled candidate compounds and a DFT-trained machine learned force field (MLFF) filters them to valid TS, the resulting products of which enter the search as new seeds. Starting from eight pre-biotic seeds (CH _2 O, H _2 O, OH ^- , H _3 O ^+ , CO _2 , H _2 CO _3 , HCO _3^- , H), ReactionAtlas discovers \sim 47,000 reactions among \sim 12,000 compounds. The MLFF TSs match the PBE0 references within 0.5 Å RMSD in 85% of the cases and can be easily brought to the PBE0 level. Thus, ReactionAtlas maps small carbohydrate chemistry up to C _4 H _8 O _4 at unprecedented scale and accuracy, including charge and stereo information. It enables novel insights into many well-studied reaction paths, including the formose cycle, which we highlight for its centrality to the chemical origins of life. Notably, our framework also allows establishing alternative reaction pathways for formose chemistry.
[LG-58] Joint discovery of governing partial differential equations from multi-source datasets by competitive optimization
链接: https://arxiv.org/abs/2606.30699
作者: Hao Xu,Siyu Lou,Yuntian Chen,Dongxiao Zhang
类目: Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:Discovering governing equations directly from observational data is a key step towards interpretable scientific machine learning. Current data-driven approaches typically operate on a single dataset, inherently limiting their performance when faced with restricted observations. In practice, multiple datasets are often available for the same physical system, distinguished only by distinct initial conditions or boundary configurations. Here, we present a competitive optimization framework designed to discover shared partial differential equations (PDEs) from multi-source datasets, termed MCO-PDE. The framework first trains independent neural surrogates for each data source, and then employs a soft-competitive weighting mechanism to dynamically assess dataset credibility and aggregate a consensus global coefficient. Integrated with a genetic algorithm for structural search, this approach simultaneously identifies the functional forms and parameters of the governing laws. We demonstrate that fusing as few as 50 observations per dataset across seven cases recovers canonical equations with high accuracy. The framework inherently handles two- and three-dimensional domains characterized by irregular boundaries and heterogeneous coefficients, and successfully extracts physically meaningful laws from real-world wave-tank experiments. Overall, this work establishes a promising route for automated scientific discovery via heterogeneous data fusion.
[LG-59] Criticality-Constrained Iterative Pruning for Energy-Efficient Spiking Neural Networks via Combined Importance Scoring
链接: https://arxiv.org/abs/2606.30676
作者: Muhammad Hamza
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: 30 pages, 6 figures, 10 tables
Abstract:Deploying spiking neural networks (SNNs) on neuromorphic hardware demands aggressive synaptic pruning while preserving temporal computation integrity. Existing strategies either neglect neuronal criticality or rely on convex relaxations of the inherently combinatorial pruning problem whose fractional masks, upon binarisation, destroy accuracy at moderate-to-high sparsity. We present Criticality-Constrained Quadratic Pruning (CQP), a native PyTorch pipeline that fuses weight magnitude with surrogate-gradient criticality into an analytically exact importance metric, eliminating the rounding artefacts endemic to solver-based approaches. We formally characterise a continuous-relaxation trap wherein OSQP-solver fractional masks overshoot the intended sparsity by up to 12 percentage points (pp), precipitating a 44 pp accuracy collapse. We identify and remediate a zombie-weight failure mode in which Adam’s first-moment tensors resurrect pruned synapses, violating the binary sparsity guarantee. An iterative schedule - prune, fine-tune with gradient masking, recompute criticality, and repeat - eliminates gradient staleness at high sparsity. A KL-divergence temporal analysis identifies a redundant simulation timestep, enabling a free 10% theoretical energy reduction without weight modification. On MNIST (60,000 training examples), CQP yields 95.6% accuracy at 90% sparsity versus 93.4% for magnitude pruning (+2.2 pp). A criticality-threshold sweep reveals an empirical criticality cliff: accuracy falls from 87.0% to 14.4% as the threshold reaches tau = 0.9, constituting a quantitative SNN-level analogue of the Critical Brain Hypothesis. Combined weight sparsification and temporal truncation yield a compound 73% reduction in per-inference energy at 70% sparsity, confirming the practical value of the proposed pipeline for neuromorphic deployment.
[LG-60] Explainable Artificial Intelligence For The Detection and Characterisation of Stage B Heart Failure
链接: https://arxiv.org/abs/2606.30665
作者: Ahmed M Salih,Emer Brady,Ranjit Arnold,Gaurav Gulsin,Huiyu Zhouyb,Anvesha Singh,Gerry McCanna
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Stage B heart failure is characterized by asymptomatic structural or functional cardiac abnormalities. Identifying individuals at this stage is clinically important, as early detection may enable targeted interventions to prevent progression to symptomatic disease. Explainable artificial intelligence (XAI) may support early detection, transparent risk stratification, and selection of clinically actionable interventions. This review examines the use of XAI in detecting and characterizing stage B heart failure. A literature search of Web of Science, Scopus, and PubMed was conducted on 27 March 2026. Studies were included if they applied AI with XAI techniques to stage B heart failure. After screening, 20 studies were included. Data on modalities, outcomes, demographic reporting, and XAI methods were extracted and synthesized. SHAP was the most commonly used method, followed by LIME, saliency maps, and Grad-CAM; however, XAI adoption was inconsistent, with some studies relying on limited or ad hoc interpretability approaches. Notably, none compared explanations across sex or ethnic subgroups, despite evidence of subgroup differences in disease burden. Evaluation of XAI outputs was often insufficient: some studies did not assess explanations, while others relied only on literature-based comparisons, introducing potential bias. These limitations suggest explainability was not systematically validated or leveraged to support robust and fair clinical inference. XAI shows promise for improving transparency in stage B heart failure identification, but current implementations remain limited. Key gaps include limited consideration of sex and ethnicity, absence of subgroup-specific analyses, inconsistent evaluation, and lack of external validation, all of which constrain generalisability and clinical adoption.
[LG-61] Random Reshuffling Dominates Stochastic Gradient Descent COLT2026
链接: https://arxiv.org/abs/2606.32005
作者: Zijian Liu
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: COLT 2026
Abstract:Stochastic Gradient Descent ( \textsfSGD ) is one of the most classical optimization algorithms with favorable theoretical guarantees, yet the practical implementation of \textsfSGD differs subtly from its well-known form and is often referred to as Shuffling Stochastic Gradient Descent ( \textsfShuffling SGD ). A particularly popular strategy in \textsfShuffling SGD is Random Reshuffling ( \textsfRR ), which has achieved great empirical success across numerous experiments. Despite its strong performance, \textsfRR has long been considered a heuristic due to a lack of theoretical support. Over the last decade, people have finally established provable convergence rates for \textsfRR , thus justifying its observed superiority. However, for smooth convex optimization, two clouds over the convergence theory of \textsfRR remain to this day. More precisely, according to the current theory, \textsfShuffling SGD under \textsfRR converges only when the stepsize is smaller than a threshold proportional to 1/n , where n is the number of summands in the objective (or the number of data points). Consequently, the optimally tuned theoretical rate of \textsfShuffling SGD under \textsfRR is strictly worse than that of \textsfSGD when the number of epochs is smaller than another threshold proportional to n . These two restrictions heavily limit the applicability of existing theories and leave a critical mismatch with practice. In this work, for the first time, we prove that \textsfRR dominates \textsfSGD in smooth convex optimization under any reasonable stepsize after any finite number of epochs, thereby addressing a longstanding open question.
[LG-62] Accelerating Conformal Prediction via Approximate Leave-One-Out
链接: https://arxiv.org/abs/2606.31915
作者: Jiachen Cong,Jingbo Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:While conformal prediction provides a general framework for uncertainty quantification in predictive inference, its application is often limited by computational cost. Recent methods, including Jackknife+ and Jackknife-minmax, achieve faster computation by trading a slight loss of efficiency relative to full conformal prediction, but still requires computing leave-one-out refits for all observations. In this paper, we further accelerate conformal prediction by incorporating approximate leave-one-out (ALO) estimators, and establish asymptotic coverage and efficiency. While our proof draws on methods developed for analyzing the consistency of ALO cross-validation risk estimators in high-dimensional statistics, it requires adaptations to handle conformal prediction, where leave- i -out residuals are needed for predictions at x_n+1 rather than just at the training covariate x_i . Simulation results validate our theoretical findings, showing that the ALO-based methods achieve coverage and efficiency comparable to the exact methods, while significantly reducing the runtime.
[LG-63] Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation INTERSPEECH26
链接: https://arxiv.org/abs/2606.31729
作者: Dominika Woszczyk,Andreas Triantafyllopoulos,Jura Miniota,Éva Székely,Bjoern Schuller
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: Accepted at Interspeech 26’
Abstract:Text-to-speech (TTS) evaluation is an open challenge. While the primary target was “naturalness,” recent fidelity gains shifted focus toward “appropriateness” and whether speech is correct for its context. In this work, we examine how perception changes when the expected downstream use varies. We measure the appropriateness and human-likeness of five SOTA TTS systems across five domains: AI assistant, reader, actor, animated character, and spontaneous speaker. Results show appropriateness varies across domains independently of naturalness. While systems shine at reading, expressive domains remain challenging, and optimizing for one can degrade others. Furthermore, naturalness scores tend to penalize stylized speech while rewarding spontaneity. Finally, our study also highlights blind spots in one-size-fits-all evaluation metrics across more expressive domains. We demonstrate that TTS performance is not “solved” but depends on the target domain, requiring context-aware evaluation.
[LG-64] On Optimal Data Splitting for Split Conformal Prediction
链接: https://arxiv.org/abs/2606.31600
作者: Sayan Das,Bahram Yaghooti,Todd A. Kuffner,Soumendra N. Lahiri
类目: atistics Theory (math.ST); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Conformal prediction and its variants, including the split conformal prediction, provide a distribution-free framework for uncertainty quantification by constructing prediction intervals or sets with finite-sample coverage guarantees. The statistical efficiency of these intervals depends critically on how the data are split into training and calibration samples. Despite its practical importance, a principled characterization of the training-calibration split that minimizes prediction interval length while maintaining coverage has remained largely unresolved. In this paper, we develop a theoretical framework for optimal data splitting in split conformal prediction. We first analyze the problem in a general setting and derive analytical characterizations of the length-optimal split ratio under both symmetric and asymmetric regimes. We then show how the general results specialize to several commonly used regression settings, including linear regression, nonparametric regression, and neural networks, thereby demonstrating the scope of the framework. We also describe a data-based method for selecting the optimal proportion. Our analysis clarifies how model-related features govern the optimal allocation of samples between training and calibration and provides principled guidance for constructing shorter prediction intervals. Experiments on both synthetic and real-world datasets demonstrate the applicability of the proposed methodology across a variety of practical scenarios.
[LG-65] Direction-Magnitude Decomposition for Low-Rank Matrix Optimization: Faster Convergence and Saddle-to-saddle Dynamics
链接: https://arxiv.org/abs/2606.31390
作者: Yudong Wei,Liang Zhang,Bingcong Li,Niao He
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:
Abstract:Low-rank matrix optimization is often carried out via the Burer-Monteiro (BM) formulation, but choosing the factorization rank r is delicate and can substantially slow optimization. We propose a unified framework, termed direction-magnitude decomposition (DMD), that decomposes the optimization variable to improve optimization efficiency even when the target rank is unknown. We develop two DMD-based approaches and establish their theoretical advantages on the canonical problem of matrix factorization. The first, overparameterized DMD, uses a rank r larger than necessary and enjoys faster convergence as r increases. The second, recursive DMD, is motivated by the incremental eigenpair learning, or saddle-to-saddle, behavior of overparameterized DMD. It achieves lower memory and computational costs, complementing overparameterized DMD. Both approaches are exponentially faster than gradient descent applied to the BM formulation. Numerical experiments on matrix factorization, sensing, and completion corroborate our theoretical findings and demonstrate the practical effectiveness of DMD.
[LG-66] MNAR-k-means: A k-means Clustering for Data Missing Not at Random with Magnitude-Decaying Probability
链接: https://arxiv.org/abs/2606.31253
作者: Xin Guan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:The classical k -means clustering, based on distances computed from all data features, cannot be directly applied to incomplete data with missing values. A natural extension of k -means to missing data is to involve only the observed positions in clustering, which is equivalent to imputing missing values by corresponding cluster means. However, for data missing not at random (MNAR), since missingness is related to data values, such a mean-imputation-based method may lead to the distortion of estimated cluster centers, resulting in a poor clustering result. Since MNAR mechanisms are very common in reality, it is necessary to improve the performance of k -means-based clustering methods for such data. In this paper, we focus on a magnitude-decaying MNAR scenario where data is more likely to be missing at positions with smaller absolute values, and we propose a novel k -means clustering method based on the constraint of the size of imputation values, which enjoys a good mathematical interpretation. Moreover, we establish the statistical consistency of the estimated cluster centers of the proposed method to the true cluster centers of fully observed data, and solve the optimization of the proposed loss function via an alternative minimization algorithm. Simulation experiments verify the effect of the proposed method in improving clustering results and reducing the bias of estimated cluster centers. Applications to real-world missing data further show the utility of the proposed method.
[LG-67] Scaling Storm-Resolving Atmospheric AI Simulation to the Entire Planet
链接: https://arxiv.org/abs/2606.31248
作者: Zeyuan Hu,Akshay Subramaniam,Noel Keen,Tao Ge,Jaideep Pathak,Mohammad Shoaib Abbas,Suman Ravuri,Karthik Kashinath,Naser Mahfouz,Peter Caldwell,Mike Pritchard,Noah Brenowitz
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 34 pages, 23 figures, 7 tables
Abstract:Kilometer-scale convection shapes precipitation extremes, tropical organization, and cloud feedbacks, but most global atmospheric models approximate these processes at 25-100 km resolution. Global storm-resolving physics models resolve convective systems explicitly, but at a cost – roughly one MWh per simulated day on exascale supercomputers – that limits long-duration simulation. We introduce STRATA (Storm-resolving Tile-based autoRegressive Atmosphere Transformer Architecture), the first autoregressive AI emulator for global storm-resolving atmospheric dynamics. STRATA is trained on the highest-resolution atmospheric dataset yet used for global AI emulation: 17 days of SCREAM physics-model output at 4.9-km resolution (~25 million grid cells) sampled every 10 minutes. Our central premise is that on 10-minute timescales atmospheric dynamics are predominantly local, so training on small spatial tiles trades scarce global temporal samples for abundant local spatial samples and enables global rollout via overlapping-tile blending. STRATA combines 3D patch embedding and local 3D neighborhood attention, a novel Stereographic Rotary Position Embedding (StereoRoPE) for grid-invariant encoding, and a pixel-space de-aliasing decoder that suppresses patch-scale rollout artifacts. An iso-FLOP scaling study reveals that km-scale emulation requires ~10x more FLOPs per grid point than coarse-resolution AI weather models, consistent with the higher information density of convective-scale dynamics. Trained on only 17 days of data, STRATA produces stable 24-hour global rollouts with realistic km-scale dynamics across diverse regimes, though large-scale biases develop with lead time. It achieves 48 simulation days per megawatt-hour – about 50 times better energy efficiency than the SCREAM physics model – and 741 simulated days per wall-clock day at 512 H100 GPUs. Code and dataset are publicly available.
[LG-68] Dynamic Gaussian Processes and the Vanilla-SPDE Exchange
链接: https://arxiv.org/abs/2606.31063
作者: Rui-Yang Zhang,Lachlan Astfalck,Edward Cripps,David Leslie,Henry Moss
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
*备注:
Abstract:Gaussian process inference is often limited by cubic computational costs, a challenge that becomes more pronounced in spatio-temporal settings where posterior inference is required over dense grids. While state-space SPDE formulations enable linear complexity in time, exact inference remains cubic in space and deteriorates further when observation locations are disjoint from the prediction locations, which inflates the number of considered spatial points. To address this, we propose the Vanilla-SPDE Exchange, which exploits an equivalence between the standard and SPDE formulations of GP inference to construct a hybrid scheme with improved computational cost. We demonstrate these gains through complexity analysis and numerical experiments.
[LG-69] Hierarchical Clustering As a Novel Solution to the Notorious Multicollinearity Problem in Observational Causal Inference KDD2023
链接: https://arxiv.org/abs/2606.30992
作者: Yufei Wu,Zhiying Gu,Alex Deng,Jacob Zhu,Linsha Chen
类目: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP)
*备注: Presented at the KDD 2023 Workshop on Causal Inference and Machine Learning in Practice, Long Beach, CA; also presented at the 2023 Joint Statistical Meetings
Abstract:Multicollinearity is a long lasting challenge in observational causal inference, especially in regressions – highly correlated independent variables make it hard to isolate their individual impacts on outcomes of interest. While common solutions such as shrinkage estimators and principal component regressions are helpful in prediction problems, a crucial limitation hinders their applicability to causal inference problems – they cannot provide the original causal relationships. To fill the gap, we present an innovative and intuitive solution, by employing hierarchical clustering to aggregate data in a way that effectively alleviates collinearity. This method is generally applicable to causal problems featuring multicollinearity. We use a marketing application to demonstrate how and why it works. Expenditures on different advertising channels often exhibit correlations, making it exceedingly difficult to separately measure their impact. Many previous studies proposed to leverage granular cross-sectional data for better identification but, to our knowledge, none explicitly addressed multicollinearity, which undermines causal identification even with granular data. We propose to hierarchically cluster geographic units based on marketing spend correlation to reduce collinearity, and to implement a Bayesian Marketing Mix Model with cluster-level data. Such clustering happens in two steps – we first normalize and demean geo-level data to establish a common scale and to eliminate the common trends; we then calculate pairwise distance to summarize marketing spend correlation between geos and cluster the ones with moderate to strong correlation. Both descriptive evidence and regression analysis affirm that such hierarchical clustering effectively mitigates collinearity and facilitates the separate identification of the impact of different marketing channels. Comments: Presented at the KDD 2023 Workshop on Causal Inference and Machine Learning in Practice, Long Beach, CA; also presented at the 2023 Joint Statistical Meetings Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Applications (stat.AP) Cite as: arXiv:2606.30992 [stat.ME] (or arXiv:2606.30992v1 [stat.ME] for this version) https://doi.org/10.48550/arXiv.2606.30992 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-70] ElemeNet: Multiscale Molecular Machine Learning with Uncertainty Quantification Across the Periodic Table
链接: https://arxiv.org/abs/2606.30961
作者: Jacob W. Toney,Samir Darouich,Yiran Wang,Aaron G. Garrison,Johannes Kästner,Heather J. Kulik
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:Advances in deep learning architectures and representations have enabled ML-driven chemical property prediction, but state-of-the-art (SOTA) models have remained largely confined to independent codebases and lack support for diverse chemical species. This work introduces ElemeNet, a unified, general-purpose software package for molecular machine learning. The ElemeNet software package enables the training of advanced ML models for diverse properties and datasets with an enlarged range of elemental compositions. We define molecular representations compatible with elements 1-100, supporting diverse organometallic and biological systems in addition to organic chemistry already well-served by the Chemprop ML toolkit. As well as more common atom-, bond-, and molecule-level predictions, we introduce moiety predictions. We also natively define optional conditioning on charge and spin states. Advanced E(3)-equivariant and transformer architectures are supported, as well as classical 2D models, with all classes including built-in uncertainty quantification through deterministic and statistical measures. We benchmark our protocols for ML model training against representative datasets from organic, inorganic, coordination, and biological chemistry, achieving competitive and SOTA performance relative to literature baselines and favorable scaling to millions of molecules. The entire workflow is exposed through a concise command-line interface, lowering the barrier to entry for non-expert users. We anticipate ElemeNet will empower non-computational researchers to leverage modern deep learning methods across the chemical and physical sciences.
[LG-71] SGD at the Edge of Stability: Stochastic Stabilization with Large Learning Rates
链接: https://arxiv.org/abs/2606.30930
作者: Konstantinos Emmanouilidis,Lachlan MacDonald,Salma Tarmoun,Rene Vidal
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Modern deep learning has been shown to operate at the edge of stability, routinely using learning rates far larger than those justified by classical optimization theory. Most prior analyses of the edge of stability phenomenon focus on deterministic gradient descent, leaving the stochastic setting largely unexplored. In this work, we provide sharp convergence guarantees for Stochastic Gradient Descent (SGD) applied to the multiclass cross-entropy loss, for both linear classifiers and two-layer neural networks. We show that the stochasticity of SGD may cause the dynamics to alternate between an edge-of-stability regime that is dominated by curvature-driven oscillations, and a stable regime in which the expected loss decreases at a controlled rate. Despite that, we prove that SGD self-stabilizes the dynamics, ensuring that the iterates return to stability in a fixed number of iterations and allowing convergence in the best-iterate sense even with large learning rates. Experiments validate our theoretical findings and illustrate the benefits of SGD in the large-stepsize regime.
[LG-72] Conditional Tropical Cyclogenesis Rates via Rare-Event Sampling in a Neural Weather Emulator
链接: https://arxiv.org/abs/2606.30920
作者: John S. Schreck,William Chapman,Charlie Becker,David John Gagne II
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:We couple Forward Flux Sampling (FFS), a non-equilibrium rare-event technique from statistical mechanics, to a neural weather emulator (SDL-WXFormer, 1° grid spacing) to estimate conditional tropical cyclogenesis rates, or how often a tropical cyclone achieves a hurricane-level central pressure, without modifying model dynamics. Tropical cyclogenesis rates vary by orders of magnitude across regimes, yet direct ensemble sampling cannot resolve this variability at operationally feasible ensemble sizes. FFS decomposes the rare disturbance to mature cyclone intensification path into a flux through an initial interface pressure and a product of conditional crossing probabilities across four intermediate interface pressures. We use the 1° emulator because FFS requires O(10^4) model trajectories per initial condition, and because the model’s calibrated stochastic layers provide the necessary exploratory spread. Applied to 98 Atlantic basin initial conditions spanning 21 August - 8 October 2022, FFS resolves genesis rates spanning nearly three orders of magnitude, capturing a seasonal cycle qualitatively consistent with observations. A self-consistency check comparing FFS rates to independent direct-sampling rates yields a mean ratio of 1.03 +/- 0.15 across all initial conditions. Computational enhancement factors range from 3X (most active environment) to 140X (most suppressed), with a geometric mean of 14X. Three case studies illustrate the physical diagnostics the method provides: the rate-limiting step is initial tropical organization for the Earl environment, uniformly high crossing probabilities for the Fiona precursor environment, and a compound barrier at the final intensification stages for the Ian environment. More efficient emulators would enable application of FFS to finer resolutions.
[LG-73] Structure-Regularized Interpretable TCR-Epitope Prediction
链接: https://arxiv.org/abs/2606.30902
作者: Jiarui Li,Zixiang Yin,Yunbei Zhang,Janet Wang,Samuel J. Landry,Zhengming Ding,Ramgopal R. Mettu
类目: Biomolecules (q-bio.BM); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:T cell receptor (TCR)-epitope binding prediction is essential for understanding adaptive immunity and developing immunotherapies. Existing sequence- and structure-based models often generalize poorly to unseen epitopes and provide limited interpretability. Furthermore, the impact of generated structures on model learning remains unclear. We present TCR-SRIM, a structure-regularized interpretable-by-design model that combines protein language model embeddings with interpretable contact prototypes to capture residue-level TCR-epitope interactions. TCR-SRIM achieves state-of-the-art predictive performance and improved interpretation quality on the TCR-XAI benchmark. Using its inherent interpretability, we further evaluate the effect of generated structures on model learning. While structures predicted by AlphaFold3, TCRModel2, and tFold-TCR yield competitive performance, they lead to less accurate interaction patterns and reduced binding-site diversity than experimentally-resolved structures. Our results highlight limitations of current structure prediction models for TCR-epitope learning and demonstrate the value of interpretable-by-design models for studying generated biological structures.
[LG-74] Dynamic Prediction of Alternating Recurrent Events via Neural Network
链接: https://arxiv.org/abs/2606.30889
作者: Abigail Loe,Susan Murry,Zhenke Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:Alternating recurrent events – event-times of a specific nature that trigger a secondary refractory period – occur in a wide-range of fields, including behavioral science, criminal justice, and biostatistics. Analysis of these events requires careful attention to the statistical nuance, including correlated observations and repeated outcomes subject to potential censoring. We develop an online dynamic prediction framework appropriate for predicting subsequent alternating recurrent events, by developing neural network theory for a statistical audiences and applying inverse probability weighted pseudo-observations. The proposed model is applied to dynamically predict alternating recurrent event-free time, showing good performance in simulation, and outstanding capability in application to predicting periods of low mood for first-year medical residents. We close with a discussion.
[LG-75] Separation Capacity of Scattering Networks
链接: https://arxiv.org/abs/2606.30822
作者: Konstantin Häberle,Helmut Bölcskei
类目: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG); Complex Variables (math.CV)
*备注: 36 pages, 10 figures
Abstract:In this paper, we attempt to enhance the theoretical understanding of convolutional neural networks (CNNs) as feature extractors in classification tasks by analyzing them through the lens of Cover’s function-counting theory. Specifically, our focus lies on the notion of separation capacity, a combinatorial quantity derived from counting the number of realizable dichotomies (i.e., binary label assignments). Our contributions are threefold. First, we extend Cover’s framework by establishing a conceptually insightful and practically useful formulation for the separation capacity. Second, leveraging this formulation, we identify the factors governing the separation capacity of feature extractors that employ a specific CNN architecture, so-called scattering networks, in terms of their network building blocks. Third, we provide practical insights for scattering network design.
[LG-76] Diffusion-warm sampling of the XY model enables fast thermalization at scale
链接: https://arxiv.org/abs/2606.30773
作者: Sehmimul Hoque,Roger Melko,Pooya Ronagh
类目: Quantum Physics (quant-ph); Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
*备注: 17 pages, 10 figures
Abstract:We introduce a novel technique for scalable sampling of spin-system states with continuous symmetries using diffusion models. By applying our approach to the XY model, a fundamental continuous-spin model in condensed matter physics, we show that our technique addresses the shortfalls of the Markov chain Monte Carlo (MCMC) in generalization to varying system sizes. More specifically, we show that training a temperature-conditioned diffusion model on smaller-size XY model lattices enables the generation of accurate samples in larger lattice sizes. By tracking physically important observables of the model, such as spin correlations, our experiments demonstrate that diffusion sampling followed by a few MCMC steps reduces the thermalization time by an order of magnitude relative to the standard MCMC with random initialization. Our study provides valuable insight as to how generative models can be used to study continuous-state condensed matter systems at scale.
[LG-77] MediEncoder: Nonlinear Representation Learning for High-Dimensional Causal Mediation Analysis
链接: https://arxiv.org/abs/2606.30648
作者: Shi Bo,Debarghya Mukherjee,AmirEmad Ghassami
类目: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: 43 pages, 3 figures
Abstract:Causal mediation analysis decomposes a treatment effect into indirect pathways through mediators and direct pathways not operating through them. Modern biomedical studies often involve high-dimensional covariates and mediators that are noisy proxies for lower-dimensional latent biological processes. Existing methods typically rely on sparsity, linear factor models, or ignore the connection among variables in the learned representations, which can be restrictive when measurements are nonlinear and covariate and mediator factors are structurally dependent. We propose MediEncoder, a representation-learning framework for nonlinear high-dimensional mediation analysis. MediEncoder jointly learns low-dimensional covariate and mediator representations using a coupled encoder-decoder architecture with a cross-factor network that links treatment and covariate representations to mediator representations. The learned features are then used in a cross-fitted efficient influence function-based estimator of natural direct and indirect effects. The resulting estimator is multiply robust and asymptotically normal under suitable regularity conditions. Simulations show that MediEncoder improves estimation accuracy over competing dimension-reduction approaches, and an application to Alzheimer’s Disease Neuroimaging Initiative data illustrates its utility in high-dimensional biomedical causal mediation analysis.
[LG-78] Analysis of Atomic Charge State and Atomic Number for VAMOS Magnetic Spectrometer using Deep Neural Networks and Fractionally Labelled Events DATE
链接: https://arxiv.org/abs/2507.07109
作者: M. Rejmund,A. Lemasson
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex); Atomic Physics (physics.atom-ph); Data Analysis, Statistics and Probability (physics.data-an)
*备注: Update figures 5 and 6
Abstract:The VAMOS++ magnetic spectrometer is a multi-parametric system that integrates ion optical magnetic elements with a multi-detector stack. The magnetic elements, along with the tracking and timing detectors and the trajectory reconstruction method, provide the analysis of the magnetic rigidity, the trajectory length between the beam interaction point and the focal plane of the spectrometer, and the related velocity and mass-over-charge ratio. The segmented ionization chamber provides the energy measurements necessary to analyze the atomic charge state and atomic number. However, this analysis critically suffers from inherent limitations due to the variable thickness and non-uniformity of the entrance window of the ionization chamber and other detector imperfections. Conventionally, this meticulous, detailed analysis is exceptionally tedious, often requiring several months to complete. We present a novel method utilizing deep neural networks, trained on an experimental dataset with only a small fraction of precisely labeled events for the lowest and best-resolved atomic charge states or numbers. This innovative approach enables the networks to autonomously and accurately classify the remaining events. This method drastically accelerates the acquisition of high-resolution atomic charge state and atomic number spectra, reducing analysis time from months to mere hours. Crucially, by discarding human bias, this approach ensures standardized, optimal, and reproducible results with unprecedented efficiency.
[LG-79] Seven-dimensional Trajectory Reconstruction for VAMOS
链接: https://arxiv.org/abs/2503.18959
作者: M. Rejmund,A. Lemasson
类目: Instrumentation and Detectors (physics.ins-det); Machine Learning (cs.LG); Nuclear Experiment (nucl-ex)
*备注: Accepted for publication in Nucl. Instr. and Methods A
Abstract:The VAMOS++ magnetic spectrometer is characterized by a large angular and momentum acceptance and highly non-linear ion optics properties requiring the use of software ion trajectory reconstruction methods to measure the ion magnetic rigidity and the trajectory length between the beam interaction point and the focal plane of the spectrometer. Standard measurements, involving the use of a thin target and a narrow beam spot, allow the assumption of a point-like beam interaction volume for ion trajectory reconstruction. However, this represents a limitation for the case of large beam spot size or extended gaseous target volume. To overcome this restriction, a seven-dimensional reconstruction method incorporating the reaction position coordinates was developed, making use of artificial deep neural networks. The neural networks were trained on a theoretical dataset generated by standard magnetic ray-tracing code. Future application to a voluminous gas target, necessitating the explicit inclusion of the three-dimensional position of the beam interaction point within the target in the trajectory reconstruction method, is discussed. The performances of the new method are presented along with a comparison of mass resolution obtained with previously reported model for the case of thin-target experimental data.
附件下载


